This library is the product of my explorations into succinct data structures for data visualization. It implements several bit vector types with rank and select operations, each specialized to a particular data pattern:
- The dense bit vector is based on the approach proposed in Fast, Small, Simple Rank/Select on Bitmaps.
- The sparse bit vector is implemented using Elias-Fano encoding.
- The run-length-encoded bit vector uses an idea I developed together with Gonzalo Navarro, which is a tweak to a RLE bit vector described in his book to improve the efficiency of rank queries. We wrote a technical note about it.
- The array bit vector is backed by a plain array, and serves as a useful baseline for correctness testing.
In addition, there are also a few wrapper types that grant new powers to an existing bit vector:
- Multi<T> wraps a plain bit vector and turns it into a "multi bit vector" which stores counts and allows multiple repetitions of the same bit to be stored (if a bit vector is a set, a multi bit vector is a multiset).
- ZeroPadded<T> embeds a bit vector into a larger universe by padding it on the left and right with 0-bits. These extra bits take no space as they are represented implicitly.
The crown jewel of this crate is the wavelet matrix data structure, which generalizes rank/select bit vectors to larger integer alphabets than (0, 1). (The wavelet matrix is a variant of the wavelet tree that improves space efficiency and runtime performance, particularly for large alphabets.)
This crate provides wavelet matrix rank and select, as well as a few other useful operations like range count and range quantile.
Storing Morton codes in the wavelet matrix enables multi-dimensional range count queries, which is great fun for zoomable density scatterplots with adaptive level-of-detail.
This crate defines several traits:
- BitVec is implemented by plain bit vectors (which are functionally integer sets)
- MultiBitVec is implemented by bit vector types that support storing the same integer multiple times.
ArrayBitVec
andSparseBitVec
store repetitions explicitly (each copy takes more space), whileMulti<T>
encodes multiplicities in a separate bit vector of counts, so it can store large counts efficiently. - BitVecBuilder and MultiBitVecBuilder are builder traits corresponding to the two traits above.
These traits enable writing code that is parametric over any particular bit vector type. For convenience, the builders have access to their target bit vector type as an associated type, and the bit vectors similarly have access to their builder type, which helped greatly when writing parametric test functions and enabled reusing test code across all concrete implementations of these traits.
Each builder type also contains an associated type describing the valid configuration options for its bit vector type, which turned out to be a nice way to enable customizability while maintaining a coherent interface.
To turn a MultiBitVec into a BitVec, see BitVecOf.
This package provides experimental work-in-progress WebAssembly bindings to all of its bit vectors as well as the wavelet matrix, implemented in js.rs
. The bindings use another package I wrote, to_js, which implements basic Rust–JS bindings in around 750 lines of Rust. I didn't have a concrete use for the WebAssembly bindings when I was implementing this package so they're in a bit a proof of concept phase at the moment (but they do work!)
The library currently contains a collection of tests for the core traits (and through them for all of the concrete bit vector types), as well as exhaustive tests for small universes using Exhaustigen, and some basic property tests using arbtest.
- Add Huffman-compressed wavelet matrix construction and processing. (The JS library in this repository implements some of this)
- Add compressed bit vectors as described in Fast, Small, Simple Rank/Select on Bitmaps
- Add quad vectors and the quad wavelet matrix. Explore its use for two-dimensional range queries without the need for Morton masks.
- Paper: Faster wavelet trees with quad vectors
- Paper: Faster Wavelet Tree Queries
- Code for an existing QWT implementation
- More testing
- Add more tests for rank1_batch, which is currently only spot-tested, and for the wavelet matrix, whose full functionality is not covered by the existing spot tests
- Add tests for the individual bit vectors that test the particular patterns each type is specialized for, and test each bit vector's range of configuration options.
- various numbers of runs and run-lengths for rle, verifying the space savings
- large universes and varying split points for sparse
- varied densities and sampling options for dense