Skip to content

Conversation

@itsibitzi
Copy link
Contributor

No description provided.

@itsibitzi itsibitzi marked this pull request as ready for review January 16, 2026 00:33
@itsibitzi itsibitzi requested a review from a team as a code owner January 16, 2026 00:33
Copilot AI review requested due to automatic review settings January 16, 2026 00:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces HRIBLT (Hierarchical Rateless Bloom Lookup Tables), a novel set reconciliation algorithm that computes the symmetric difference between sets with data transfer proportional to the size of the difference rather than the overall set size. The implementation includes encoding and decoding sessions, comprehensive test coverage, and documentation.

Changes:

  • Added a new hriblt crate implementing a rateless set reconciliation algorithm using XOR-based hash functions
  • Includes core algorithm implementation with encoding/decoding sessions, error types, and public API traits
  • Added comprehensive test suite with performance benchmarking and documentation including README and usage guides

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
crates/hriblt/Cargo.toml Package definition for the new hriblt crate
crates/hriblt/README.md Usage documentation with example code
crates/hriblt/src/lib.rs Core algorithm implementation with hash functions and utilities
crates/hriblt/src/error.rs Error types for set reconciliation operations
crates/hriblt/src/encoding_session.rs Encoding session for creating coded symbols from sets
crates/hriblt/src/decoding_session.rs Decoding session for recovering differences from coded symbols
crates/hriblt/src/decoded_value.rs Enum representing decoded additions or deletions
crates/hriblt/src/coded_symbol.rs Coded symbol structure for invertible bloom filter
crates/hriblt/docs/sizing.md Guide for sizing HRIBLT sessions
crates/hriblt/docs/hashing_functions.md Documentation on hash function implementation
crates/hriblt/docs/assets/coded-symbol-multiplier.png Chart showing coded symbol efficiency
Cargo.toml Updated workspace member formatting
Comments suppressed due to low confidence (2)

crates/hriblt/src/encoding_session.rs:64

  • Spelling error: "te vector" should be "the vector".
    /// Returns an error if the split is out of range or if the length of te vector

crates/hriblt/docs/hashing_functions.md:11

  • Grammatical error: "the hashes produces" should be "the hashes produced".
We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guarenteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if split < range.start || split > range.end {
return Err(SetReconciliationError::SplitOutOfRange);
}
if coded_symbols.len() > range.len() {
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent validation logic: line 53 uses assert_eq!(coded_symbols.len(), range.len()) to ensure exact equality, but line 75 only checks if coded_symbols.len() > range.len(). The validation in try_from_coded_symbols should check for inequality (!=) rather than just greater-than (>) to match the behavior of from_coded_symbols and properly reject cases where the vector is too short.

Suggested change
if coded_symbols.len() > range.len() {
if coded_symbols.len() != range.len() {

Copilot uses AI. Check for mistakes.
Comment on lines +93 to +94
pub fn extend(&mut self, entities: impl Iterator<Item = T>) {
for entity in entities {
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extend method should accept impl IntoIterator<Item = T> instead of impl Iterator<Item = T> to follow Rust's standard library conventions. This would allow callers to pass ranges, vectors, or other collections directly without needing to call .into_iter() first. The current signature forces unnecessary verbosity (e.g., requiring .iter().cloned() in line 225 of lib.rs).

Suggested change
pub fn extend(&mut self, entities: impl Iterator<Item = T>) {
for entity in entities {
pub fn extend(&mut self, entities: impl IntoIterator<Item = T>) {
for entity in entities.into_iter() {

Copilot uses AI. Check for mistakes.

This library has a trait, `HashFunctions` which is used to create the hashes required to place your symbol into the range of coded symbols.

The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation reference to a non-existent file: The documentation mentions "How and why this is done is explained in the overview.md documentation" but no such file exists in the docs directory. This reference should either be removed or the file should be created.

Suggested change
The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.
The following documentation provides more details on this trait in particular.

Copilot uses AI. Check for mistakes.
let mut bits = Stats::default();
let mut encoding_time = Stats::default();
let mut decoding_time = Stats::default();
let mut deocding_time_fast = Stats::default();
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "deocding_time_fast" should be "decoding_time_fast" (typo in variable name).

Copilot uses AI. Check for mistakes.
/// Note: we want an odd number of hash functions, so that collapsing the stream to a single coded symbol
/// (or small number of coded symbols) won't erase the value information.
///
/// The second return value indicates whether the entry should be stored negated.
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect or outdated documentation: The comment states "The second return value indicates whether the entry should be stored negated" but the function only returns a single usize value, not a tuple. This documentation should be removed or corrected.

Suggested change
/// The second return value indicates whether the entry should be stored negated.

Copilot uses AI. Check for mistakes.

`y = len(coded_symbols) / diff_size`

![Coded symbol multiplier](./assets/coded-symbol-multiplier.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to visualize the standard deviation (or maybe percentiles like 90%-ile, 99%-ile). Max doesn't really make sense, since in theory it is infinite.

Those are almost more important than the average.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm. I couldn't find how you generated this file... We should/must merge that code.

/// A pure coded symbol must satisfy the following conditions:
/// - The 1-bit counter must be 1 or -1 (which are both represented by the bit being set)
/// - The checksum must match the checksum of the value.
/// - The indices of the value must match the index of this coded symbol.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(exactly) one of the indices of the value must match the index of this coded symbol.
Note: in theory it would be possible to also accept coded symbols which got added an odd number of times to the same bucket.
But this only makes a difference when the total number of buckets is <= 32 which is not really a performance critical case.

if stream_len > 32 && i % 32 != 0 {
let seed = i % 4;
let j = index_for_seed(state, value, stream_len, seed as u32);
if i == j { 1 } else { 0 }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be written differently in Rust? (same couple lines below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://godbolt.org/z/6bdEdKP54

I made the above to see if LLVM would keep the branch. It looks like foo and bar both get compiled to the same instructions in -O3 to the point where the compiler only keeps bar?

I've moved it to use i32::from(i == j) since this is less ambiguous from the compiler.


/// For statistical purposes: this number informs how many bits of the
/// checksum were required to identify pure coded symbols.
pub(crate) required_bits: usize,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering whether we still want/need this...
Essentially, the smaller this number, the more robust the procedure.

/// The encoded stream of hashes.
/// All recovered coded symbols have been removed from this stream.
/// If decoded failed, then one can simply append more data and continue decoding.
encoded: EncodingSession<T, H>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add somewhere that all coded symbols must be computed with the same seed.


/// Create a EncodingSession from a vector of coded symbols.
///
/// Returns an error if the split is out of range or if the length of te vector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns an error if the split is out of range or if the length of te vector
/// Returns an error if the split is out of range or if the length of the vector

//! The core algorithm is based on the idea of set similarity sketching where pure hashes are identified
//! by testing whether one of the 5 hash functions would map the candidate value back to the index
//! of the value. Taking advantage of this necessary condition reduces the number of necessary checksum
//! bits by a bit.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a bit"... i.e. one bit?
This is ambiguous ;)

//! This algorithm has similar properties to the "Practical Rateless Set Reconciliation" algorithm.
//! Main differences are:
//! * We only use 5 hash functions instead of log(n).
//! * As a result we only require 5 independent hash functions instead of log(n) many.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I don't recall whether they need log(n) !independent! hash functions...
It might be that independence is not a required in their scheme?

//! and a stable math library must be used!)
//! * Encoding/decoding is faster due to the fixed number of hash functions and the simpler operations.
//! * Since we have a fixed number of hash functions, we can utilize the coded symbol index as
//! an additional condition. In fact, we need to compute just a single hash function (on average).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! an additional condition. In fact, we need to compute just a single hash function (on average).
//! an additional condition. In fact, we need to compute just a single hash function (on average) for this check.

}
}

fn indices<T>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this function be moved into a separate file, since it isn't public?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe some index.rs or bucket_index.rs file?

/// (or small number of coded symbols) won't erase the value information.
///
/// The second return value indicates whether the entry should be stored negated.
fn index_for_seed<T>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for these...

/// - Fewer (ideally 3) hash functions with larger partitions lead to a higher chance to
/// find a pure value in the stream, i.e. the stream can be decoded with fewer coded symbols.
///
/// After testing various schemes, I settled for this one which uses 4 equally sized partitions,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I => ...

/// Note: we want an odd number of hash functions, so that collapsing the stream to a single coded symbol
/// (or small number of coded symbols) won't erase the value information.
///
/// The second return value indicates whether the entry should be stored negated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this last sentence, since there isn't a second return value anymore ;)


diff.sort();

assert_eq!(diff, [0, 1, 2, 3, 4, 11, 12, 13, 14, 15]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to show here that it ALSO tells you whether a value was Inserted or Removed.
Maybe use the sign of the value for demonstration purposes?


assert_eq!(diff, [0, 1, 2, 3, 4, 11, 12, 13, 14, 15]);

```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also want an example showing the hierarchical aspect of this.
I.e. both sides compute an encoding session of size 1024 and convert it into a hierarchy. Then, one side transfers 128 blocks at a time until decoding succeeds.

keywords = ["set-reconciliation", "sync", "algorithm", "probabilistic"]
categories = ["algorithms", "data-structures", "mathematics", "science"]

[dependencies]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also want some benchmarks to show throughput


If the value you're inserting into the encoding session is a high entropy random value, such as a cryptographic hash digest, you can recycle the bytes in that value to produce the coded symbol indexing hashes, instead of hashing that value again. This results in a constant-factor speed up.

For example if you were trying to find the difference between two sets of documents, instead of each coded symbol being the whole document it could instead just be a SHA1 hash of the document content. Since each SHA1 digest has 20 bytes of high entropy bits, instead of hashing this value five times again to produce the five coded symbol indices we can simply slice out five `u32` values from the digest itself.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a feature flag which provides the SHA1 functionality for you...

@@ -0,0 +1,400 @@
//! This module implements a set reconciliation algorithm using XOR-based hashes.
//!
//! The core algorithm is based on the idea of set similarity sketching where pure hashes are identified
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we ever talk about set similarity sketching anywhere else.
Dropping this term here is therefore confusing IMO.
(I'm also not quite sure what the relation is except that both procedures use some kind of hashing and an array :) )


When using HRIBLT in production systems it is important to consider the stability of your hash functions.

We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.
We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produced by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.


`y = len(coded_symbols) / diff_size`

![Coded symbol multiplier](../evaulation/overhead/overhead.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the file path got messed up :)
evaulation


![Coded symbol multiplier](../evaulation/overhead/overhead.png)

For small diffs, the number of coded symbols required per value is larger, after a difference of approximately 100 values the coefficient settles on around 1.3 to 1.4.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the graph should probably be logarithmic in the diff size to show that the variance goes down with larger n!
I.e. for large (>10000) the variance is almost completely gone.

Comment on lines +142 to +177
// Generate base set of random u64 values
let base_set: HashSet<u64> = (0..set_size).map(|_| rng.random()).collect();

// Create set A as the base set
let set_a: Vec<u64> = base_set.iter().copied().collect();

// Create set B by removing some elements and adding new ones
let mut set_b: HashSet<u64> = base_set.clone();

// Remove diff_size/2 elements from set B
let removals = diff_size / 2;
let additions = diff_size - removals;

let mut to_remove: Vec<u64> = set_b.iter().copied().collect();
to_remove.shuffle(rng);
for val in to_remove.into_iter().take(removals as usize) {
set_b.remove(&val);
}

// Add diff_size - removals new elements to set B
for _ in 0..additions {
loop {
let new_val: u64 = rng.random();
if !base_set.contains(&new_val) && set_b.insert(new_val) {
break;
}
}
}

let set_b: Vec<u64> = set_b.into_iter().collect();

// Create encoding sessions for both sets with max capacity
let state = DefaultHashFunctions;

let mut encoder_a = EncodingSession::new(state, 0..max_symbols);
encoder_a.extend(set_a.iter().copied());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole setup is crazily inefficient especially if you want to compute it for ALL diff sizes within a range.
We should at least change this code to use exponentially growing diff sizes.
On top, one doesn't need TWO sets... it is totally fine to use a single set with n elements.
If you want to be fancy, then you can insert some of the n elements as insertions and the others as deletions.

With this procedure, one can simply compute the next encodingsession by adding another (distinct) element to it. The distinct aspect can be achieved by starting off with a random set of unique integers (in random order). At every x-th integer addition/removal you can then try to decode the set.

}

// Grow by 10%, but at least 1
let growth = (current_symbols / 10).max(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It "should" be fine to go here in steps of one.


if chunk_end > chunk_start {
let chunk = merged.split_off(chunk_end - chunk_start);
decoding_session.append(chunk);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could avoid the whole loop by letting the append function return how many coded symbols were required to decode the session...
(You might still want to use the loop, since this can save copying data unnecessarily. In that case, you probably want to copy at least e.g. 32 elements at a time though or 10% whatever is larger).

let chunk_end = current_symbols.min(max_symbols);

if chunk_end > chunk_start {
let chunk = merged.split_off(chunk_end - chunk_start);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the scheme I described above, you don't want to split_off the next range, but rather create a copy for a range... This might also be what VSC probably wants to use to reuse data across multiple round-trips.

eprintln!("Results:");
eprintln!(" Successes: {}/{} ({:.1}%)", successes, args.trials,
100.0 * successes as f64 / args.trials as f64);
eprintln!(" Failures: {}/{} ({:.1}%)", failures, args.trials,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have there been failures?


for i in 0..args.trials {
let diff_size = diff_iter.next_diff_size(&mut *rng);
let max_symbols = (diff_size * args.multiplier) as usize;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you actually want some minimum and a multiplier.
The multiplier should be something like 2 and the minimum should be something like 1000 or so.


### Running the Evaluation

1. Ensure R and ImageMagick are installed with necessary packages:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: copilot created me a graph with the plotter crate.
Should we use that instead?

Typical results show:
- Small diff sizes (1-10) have higher variance and overhead due to the probabilistic nature of the algorithm
- Larger diff sizes (50+) converge to a more stable overhead around 1.3-1.5x
- The algorithm successfully decodes 100% of trials when given up to 10x the diff size in coded symbols
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10x the diff size isn't a great outcome :)
Mention here the min trick I described above (and which we use in production)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually expect that we need AT MOST 50% overhead after some diff size.

Comment on lines +87 to +98
/// This is a faster version for decoding the initial stream.
/// It processes this stream from back to front without going through the hierarchical representation.
/// The other procedure needs to execute roughly one additional `is_pure` test when unrolling the hierarchy
/// which this procedure avoids.
/// Additionally, this procedure can save on average another 50% of is_pure tests, since it won't waste time
/// on the highly packed hierarchy levels where we don't expect to find any pure values.
///
/// Panics if the encoding session is not the beginning of a stream (e.g. the range is `0..n`)
pub fn from_encoding(merged: EncodingSession<T, H>) -> Self {
assert_eq!(merged.range.start, 0);
Self::from_encoding_unchecked(merged)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have a speed benchmark + graph...

if self.non_zero == 0 {
// At this point everything should be decoded...
// We could in theory check that all remaining coded symbols are zero.
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you could return i which indicates how many symbols were required for decoding.

}

/// Adds an entity to the encoding session.
pub fn insert(&mut self, entity: T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the benchmark, it would be nice to have a remove function so that you don't have to go through 2 encoding sessions.


/// This test ensures that the parent and hash_to_index functions are consistent to each other!
#[test]
fn test_parent() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of these tests should probably be moved to the files which actually implement that functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants