Open source HRIBLT #94

itsibitzi · 2026-01-16T00:33:09Z

No description provided.

Copilot

Pull request overview

This PR introduces HRIBLT (Hierarchical Rateless Bloom Lookup Tables), a novel set reconciliation algorithm that computes the symmetric difference between sets with data transfer proportional to the size of the difference rather than the overall set size. The implementation includes encoding and decoding sessions, comprehensive test coverage, and documentation.

Changes:

Added a new hriblt crate implementing a rateless set reconciliation algorithm using XOR-based hash functions
Includes core algorithm implementation with encoding/decoding sessions, error types, and public API traits
Added comprehensive test suite with performance benchmarking and documentation including README and usage guides

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
crates/hriblt/Cargo.toml	Package definition for the new hriblt crate
crates/hriblt/README.md	Usage documentation with example code
crates/hriblt/src/lib.rs	Core algorithm implementation with hash functions and utilities
crates/hriblt/src/error.rs	Error types for set reconciliation operations
crates/hriblt/src/encoding_session.rs	Encoding session for creating coded symbols from sets
crates/hriblt/src/decoding_session.rs	Decoding session for recovering differences from coded symbols
crates/hriblt/src/decoded_value.rs	Enum representing decoded additions or deletions
crates/hriblt/src/coded_symbol.rs	Coded symbol structure for invertible bloom filter
crates/hriblt/docs/sizing.md	Guide for sizing HRIBLT sessions
crates/hriblt/docs/hashing_functions.md	Documentation on hash function implementation
crates/hriblt/docs/assets/coded-symbol-multiplier.png	Chart showing coded symbol efficiency
Cargo.toml	Updated workspace member formatting

Comments suppressed due to low confidence (2)

crates/hriblt/src/encoding_session.rs:64

Spelling error: "te vector" should be "the vector".

    /// Returns an error if the split is out of range or if the length of te vector

crates/hriblt/docs/hashing_functions.md:11

Grammatical error: "the hashes produces" should be "the hashes produced".

We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guarenteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/hriblt/docs/hashing_functions.md

crates/hriblt/src/encoding_session.rs

Copilot · 2026-01-16T00:38:24Z

crates/hriblt/src/encoding_session.rs

+        if split < range.start || split > range.end {
+            return Err(SetReconciliationError::SplitOutOfRange);
+        }
+        if coded_symbols.len() > range.len() {


Inconsistent validation logic: line 53 uses assert_eq!(coded_symbols.len(), range.len()) to ensure exact equality, but line 75 only checks if coded_symbols.len() > range.len(). The validation in try_from_coded_symbols should check for inequality (!=) rather than just greater-than (>) to match the behavior of from_coded_symbols and properly reject cases where the vector is too short.

Suggested change

if coded_symbols.len() > range.len() {

if coded_symbols.len() != range.len() {

Copilot · 2026-01-16T00:38:24Z

crates/hriblt/src/encoding_session.rs

+    pub fn extend(&mut self, entities: impl Iterator<Item = T>) {
+        for entity in entities {


The extend method should accept impl IntoIterator<Item = T> instead of impl Iterator<Item = T> to follow Rust's standard library conventions. This would allow callers to pass ranges, vectors, or other collections directly without needing to call .into_iter() first. The current signature forces unnecessary verbosity (e.g., requiring .iter().cloned() in line 225 of lib.rs).

Suggested change

pub fn extend(&mut self, entities: impl Iterator<Item = T>) {

for entity in entities {

pub fn extend(&mut self, entities: impl IntoIterator<Item = T>) {

for entity in entities.into_iter() {

Copilot · 2026-01-16T00:38:24Z

crates/hriblt/docs/hashing_functions.md

+
+This library has a trait, `HashFunctions` which is used to create the hashes required to place your symbol into the range of coded symbols.
+
+The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.


Documentation reference to a non-existent file: The documentation mentions "How and why this is done is explained in the overview.md documentation" but no such file exists in the docs directory. This reference should either be removed or the file should be created.

Suggested change

The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.

The following documentation provides more details on this trait in particular.

Copilot · 2026-01-16T00:38:25Z

crates/hriblt/src/lib.rs

+        let mut bits = Stats::default();
+        let mut encoding_time = Stats::default();
+        let mut decoding_time = Stats::default();
+        let mut deocding_time_fast = Stats::default();


Spelling error: "deocding_time_fast" should be "decoding_time_fast" (typo in variable name).

Copilot · 2026-01-16T00:38:25Z

crates/hriblt/src/lib.rs

+/// Note: we want an odd number of hash functions, so that collapsing the stream to a single coded symbol
+/// (or small number of coded symbols) won't erase the value information.
+///
+/// The second return value indicates whether the entry should be stored negated.


Incorrect or outdated documentation: The comment states "The second return value indicates whether the entry should be stored negated" but the function only returns a single usize value, not a tuple. This documentation should be removed or corrected.

Suggested change

/// The second return value indicates whether the entry should be stored negated.

crates/hriblt/README.md

crates/hriblt/docs/sizing.md

aneubeck · 2026-01-16T14:04:49Z

crates/hriblt/docs/sizing.md

+
+`y = len(coded_symbols) / diff_size`
+
+![Coded symbol multiplier](./assets/coded-symbol-multiplier.png)


We also need to visualize the standard deviation (or maybe percentiles like 90%-ile, 99%-ile). Max doesn't really make sense, since in theory it is infinite.

Those are almost more important than the average.

mmm. I couldn't find how you generated this file... We should/must merge that code.

crates/hriblt/src/coded_symbol.rs

aneubeck · 2026-01-16T14:11:56Z

crates/hriblt/src/coded_symbol.rs

+    /// A pure coded symbol must satisfy the following conditions:
+    /// - The 1-bit counter must be 1 or -1 (which are both represented by the bit being set)
+    /// - The checksum must match the checksum of the value.
+    /// - The indices of the value must match the index of this coded symbol.


(exactly) one of the indices of the value must match the index of this coded symbol.
Note: in theory it would be possible to also accept coded symbols which got added an odd number of times to the same bucket.
But this only makes a difference when the total number of buckets is <= 32 which is not really a performance critical case.

aneubeck · 2026-01-16T14:14:32Z

crates/hriblt/src/coded_symbol.rs

+    if stream_len > 32 && i % 32 != 0 {
+        let seed = i % 4;
+        let j = index_for_seed(state, value, stream_len, seed as u32);
+        if i == j { 1 } else { 0 }


Should this be written differently in Rust? (same couple lines below)

https://godbolt.org/z/6bdEdKP54

I made the above to see if LLVM would keep the branch. It looks like foo and bar both get compiled to the same instructions in -O3 to the point where the compiler only keeps bar?

I've moved it to use i32::from(i == j) since this is less ambiguous from the compiler.

crates/hriblt/src/decoding_session.rs

aneubeck · 2026-01-16T14:18:26Z

crates/hriblt/src/decoding_session.rs

+
+    /// For statistical purposes: this number informs how many bits of the
+    /// checksum were required to identify pure coded symbols.
+    pub(crate) required_bits: usize,


Wondering whether we still want/need this...
Essentially, the smaller this number, the more robust the procedure.

aneubeck · 2026-01-16T14:19:53Z

crates/hriblt/src/decoding_session.rs

+    /// The encoded stream of hashes.
+    /// All recovered coded symbols have been removed from this stream.
+    /// If decoded failed, then one can simply append more data and continue decoding.
+    encoded: EncodingSession<T, H>,


We need to add somewhere that all coded symbols must be computed with the same seed.

aneubeck · 2026-01-16T14:32:49Z

crates/hriblt/src/encoding_session.rs

+
+    /// Create a EncodingSession from a vector of coded symbols.
+    ///
+    /// Returns an error if the split is out of range or if the length of te vector


Suggested change

/// Returns an error if the split is out of range or if the length of te vector

/// Returns an error if the split is out of range or if the length of the vector

aneubeck · 2026-01-16T14:39:31Z

crates/hriblt/src/lib.rs

+//! The core algorithm is based on the idea of set similarity sketching where pure hashes are identified
+//! by testing whether one of the 5 hash functions would map the candidate value back to the index
+//! of the value. Taking advantage of this necessary condition reduces the number of necessary checksum
+//! bits by a bit.


"a bit"... i.e. one bit?
This is ambiguous ;)

aneubeck · 2026-01-16T14:42:06Z

crates/hriblt/src/lib.rs

+//! This algorithm has similar properties to the "Practical Rateless Set Reconciliation" algorithm.
+//! Main differences are:
+//! * We only use 5 hash functions instead of log(n).
+//! * As a result we only require 5 independent hash functions instead of log(n) many.


actually I don't recall whether they need log(n) !independent! hash functions...
It might be that independence is not a required in their scheme?

aneubeck · 2026-01-16T14:43:26Z

crates/hriblt/src/lib.rs

+//!   and a stable math library must be used!)
+//! * Encoding/decoding is faster due to the fixed number of hash functions and the simpler operations.
+//! * Since we have a fixed number of hash functions, we can utilize the coded symbol index as
+//!   an additional condition. In fact, we need to compute just a single hash function (on average).


Suggested change

//! an additional condition. In fact, we need to compute just a single hash function (on average).

//! an additional condition. In fact, we need to compute just a single hash function (on average) for this check.

aneubeck · 2026-01-16T14:45:37Z

crates/hriblt/src/lib.rs

+    }
+}
+
+fn indices<T>(


should this function be moved into a separate file, since it isn't public?

maybe some index.rs or bucket_index.rs file?

aneubeck · 2026-01-16T14:45:54Z

crates/hriblt/src/lib.rs

+/// (or small number of coded symbols) won't erase the value information.
+///
+/// The second return value indicates whether the entry should be stored negated.
+fn index_for_seed<T>(


same for these...

aneubeck · 2026-01-16T14:47:05Z

crates/hriblt/src/lib.rs

+/// - Fewer (ideally 3) hash functions with larger partitions lead to a higher chance to
+///   find a pure value in the stream, i.e. the stream can be decoded with fewer coded symbols.
+///
+/// After testing various schemes, I settled for this one which uses 4 equally sized partitions,


aneubeck · 2026-01-16T14:48:32Z

crates/hriblt/src/lib.rs

+/// Note: we want an odd number of hash functions, so that collapsing the stream to a single coded symbol
+/// (or small number of coded symbols) won't erase the value information.
+///
+/// The second return value indicates whether the entry should be stored negated.


remove this last sentence, since there isn't a second return value anymore ;)

crates/hriblt/README.md

aneubeck · 2026-01-16T14:53:00Z

crates/hriblt/README.md

+
+diff.sort();
+
+assert_eq!(diff, [0, 1, 2, 3, 4, 11, 12, 13, 14, 15]);


You want to show here that it ALSO tells you whether a value was Inserted or Removed.
Maybe use the sign of the value for demonstration purposes?

aneubeck · 2026-01-16T14:54:36Z

crates/hriblt/README.md

+
+assert_eq!(diff, [0, 1, 2, 3, 4, 11, 12, 13, 14, 15]);
+
+```


I think we also want an example showing the hierarchical aspect of this.
I.e. both sides compute an encoding session of size 1024 and convert it into a hierarchy. Then, one side transfers 128 blocks at a time until decoding succeeds.

aneubeck · 2026-01-16T14:55:10Z

crates/hriblt/Cargo.toml

+keywords = ["set-reconciliation", "sync", "algorithm", "probabilistic"]
+categories = ["algorithms", "data-structures", "mathematics", "science"]
+
+[dependencies]


we also want some benchmarks to show throughput

aneubeck · 2026-01-16T14:55:42Z

crates/hriblt/docs/hashing_functions.md

+
+If the value you're inserting into the encoding session is a high entropy random value, such as a cryptographic hash digest, you can recycle the bytes in that value to produce the coded symbol indexing hashes, instead of hashing that value again. This results in a constant-factor speed up.
+
+For example if you were trying to find the difference between two sets of documents, instead of each coded symbol being the whole document it could instead just be a SHA1 hash of the document content. Since each SHA1 digest has 20 bytes of high entropy bits, instead of hashing this value five times again to produce the five coded symbol indices we can simply slice out five `u32` values from the digest itself.


I think we should have a feature flag which provides the SHA1 functionality for you...

aneubeck · 2026-01-16T15:00:08Z

crates/hriblt/src/lib.rs

@@ -0,0 +1,400 @@
+//! This module implements a set reconciliation algorithm using XOR-based hashes.
+//!
+//! The core algorithm is based on the idea of set similarity sketching where pure hashes are identified


I don't think we ever talk about set similarity sketching anywhere else.
Dropping this term here is therefore confusing IMO.
(I'm also not quite sure what the relation is except that both procedures use some kind of hashing and an array :) )

aneubeck · 2026-01-20T09:45:55Z

crates/hriblt/docs/hashing_functions.md

+
+When using HRIBLT in production systems it is important to consider the stability of your hash functions.
+
+We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.


Suggested change

We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.

We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produced by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.

aneubeck · 2026-01-20T09:47:16Z

crates/hriblt/docs/sizing.md

+
+`y = len(coded_symbols) / diff_size`
+
+![Coded symbol multiplier](../evaulation/overhead/overhead.png)


looks like the file path got messed up :)
evaulation

aneubeck · 2026-01-20T09:48:58Z

crates/hriblt/docs/sizing.md

+
+![Coded symbol multiplier](../evaulation/overhead/overhead.png)
+
+For small diffs, the number of coded symbols required per value is larger, after a difference of approximately 100 values the coefficient settles on around 1.3 to 1.4.


the graph should probably be logarithmic in the diff size to show that the variance goes down with larger n!
I.e. for large (>10000) the variance is almost completely gone.

aneubeck · 2026-01-20T09:58:31Z

crates/hriblt/evaluation/bench.rs

+    // Generate base set of random u64 values
+    let base_set: HashSet<u64> = (0..set_size).map(|_| rng.random()).collect();
+
+    // Create set A as the base set
+    let set_a: Vec<u64> = base_set.iter().copied().collect();
+
+    // Create set B by removing some elements and adding new ones
+    let mut set_b: HashSet<u64> = base_set.clone();
+
+    // Remove diff_size/2 elements from set B
+    let removals = diff_size / 2;
+    let additions = diff_size - removals;
+
+    let mut to_remove: Vec<u64> = set_b.iter().copied().collect();
+    to_remove.shuffle(rng);
+    for val in to_remove.into_iter().take(removals as usize) {
+        set_b.remove(&val);
+    }
+
+    // Add diff_size - removals new elements to set B
+    for _ in 0..additions {
+        loop {
+            let new_val: u64 = rng.random();
+            if !base_set.contains(&new_val) && set_b.insert(new_val) {
+                break;
+            }
+        }
+    }
+
+    let set_b: Vec<u64> = set_b.into_iter().collect();
+
+    // Create encoding sessions for both sets with max capacity
+    let state = DefaultHashFunctions;
+
+    let mut encoder_a = EncodingSession::new(state, 0..max_symbols);
+    encoder_a.extend(set_a.iter().copied());


this whole setup is crazily inefficient especially if you want to compute it for ALL diff sizes within a range.
We should at least change this code to use exponentially growing diff sizes.
On top, one doesn't need TWO sets... it is totally fine to use a single set with n elements.
If you want to be fancy, then you can insert some of the n elements as insertions and the others as deletions.

With this procedure, one can simply compute the next encodingsession by adding another (distinct) element to it. The distinct aspect can be achieved by starting off with a random set of unique integers (in random order). At every x-th integer addition/removal you can then try to decode the set.

aneubeck · 2026-01-20T10:01:25Z

crates/hriblt/evaluation/bench.rs

+        }
+
+        // Grow by 10%, but at least 1
+        let growth = (current_symbols / 10).max(1);


It "should" be fine to go here in steps of one.

aneubeck · 2026-01-20T10:04:07Z

crates/hriblt/evaluation/bench.rs

+
+        if chunk_end > chunk_start {
+            let chunk = merged.split_off(chunk_end - chunk_start);
+            decoding_session.append(chunk);


You could avoid the whole loop by letting the append function return how many coded symbols were required to decode the session...
(You might still want to use the loop, since this can save copying data unnecessarily. In that case, you probably want to copy at least e.g. 32 elements at a time though or 10% whatever is larger).

aneubeck · 2026-01-20T10:05:14Z

crates/hriblt/evaluation/bench.rs

+        let chunk_end = current_symbols.min(max_symbols);
+
+        if chunk_end > chunk_start {
+            let chunk = merged.split_off(chunk_end - chunk_start);


In the scheme I described above, you don't want to split_off the next range, but rather create a copy for a range... This might also be what VSC probably wants to use to reuse data across multiple round-trips.

aneubeck · 2026-01-20T10:06:12Z

crates/hriblt/evaluation/bench.rs

+    eprintln!("Results:");
+    eprintln!("  Successes: {}/{} ({:.1}%)", successes, args.trials, 
+             100.0 * successes as f64 / args.trials as f64);
+    eprintln!("  Failures:  {}/{} ({:.1}%)", failures, args.trials,


have there been failures?

aneubeck · 2026-01-20T10:07:06Z

crates/hriblt/evaluation/bench.rs

+
+    for i in 0..args.trials {
+        let diff_size = diff_iter.next_diff_size(&mut *rng);
+        let max_symbols = (diff_size * args.multiplier) as usize;


you actually want some minimum and a multiplier.
The multiplier should be something like 2 and the minimum should be something like 1000 or so.

aneubeck · 2026-01-20T10:08:47Z

crates/hriblt/evaluation/README.md

+
+### Running the Evaluation
+
+1. Ensure R and ImageMagick are installed with necessary packages:


question: copilot created me a graph with the plotter crate.
Should we use that instead?

aneubeck · 2026-01-20T10:10:24Z

crates/hriblt/evaluation/README.md

+Typical results show:
+- Small diff sizes (1-10) have higher variance and overhead due to the probabilistic nature of the algorithm
+- Larger diff sizes (50+) converge to a more stable overhead around 1.3-1.5x
+- The algorithm successfully decodes 100% of trials when given up to 10x the diff size in coded symbols


10x the diff size isn't a great outcome :)
Mention here the min trick I described above (and which we use in production)

I would actually expect that we need AT MOST 50% overhead after some diff size.

aneubeck · 2026-01-20T10:14:46Z

crates/hriblt/src/decoding_session.rs

+    /// This is a faster version for decoding the initial stream.
+    /// It processes this stream from back to front without going through the hierarchical representation.
+    /// The other procedure needs to execute roughly one additional `is_pure` test when unrolling the hierarchy
+    /// which this procedure avoids.
+    /// Additionally, this procedure can save on average another 50% of is_pure tests, since it won't waste time
+    /// on the highly packed hierarchy levels where we don't expect to find any pure values.
+    ///
+    /// Panics if the encoding session is not the beginning of a stream (e.g. the range is `0..n`)
+    pub fn from_encoding(merged: EncodingSession<T, H>) -> Self {
+        assert_eq!(merged.range.start, 0);
+        Self::from_encoding_unchecked(merged)
+    }


would be nice to have a speed benchmark + graph...

aneubeck · 2026-01-20T10:15:38Z

crates/hriblt/src/decoding_session.rs

+            if self.non_zero == 0 {
+                // At this point everything should be decoded...
+                // We could in theory check that all remaining coded symbols are zero.
+                break;


here you could return i which indicates how many symbols were required for decoding.

aneubeck · 2026-01-20T10:17:38Z

crates/hriblt/src/encoding_session.rs

+    }
+
+    /// Adds an entity to the encoding session.
+    pub fn insert(&mut self, entity: T) {


for the benchmark, it would be nice to have a remove function so that you don't have to go through 2 encoding sessions.

aneubeck · 2026-01-20T10:18:53Z

crates/hriblt/src/lib.rs

+
+    /// This test ensures that the parent and hash_to_index functions are consistent to each other!
+    #[test]
+    fn test_parent() {


some of these tests should probably be moved to the files which actually implement that functionality.

initial import

b6a04ef

itsibitzi marked this pull request as ready for review January 16, 2026 00:33

itsibitzi requested a review from a team as a code owner January 16, 2026 00:33

Copilot AI review requested due to automatic review settings January 16, 2026 00:33

Copilot started reviewing on behalf of itsibitzi January 16, 2026 00:34 View session

Copilot AI reviewed Jan 16, 2026

View reviewed changes