perf: SIMD-accelerated hex encoding/decoding and image operations #39

JSKitty · 2026-02-02T02:09:58Z

Summary

This PR introduces comprehensive SIMD acceleration for performance-critical operations in Vector's backend, targeting both ARM64 (Apple Silicon, Android) and x86_64 (Windows, Linux) platforms.

Changes

1. Hex Encoding (bytes → hex string)

Implementation:

ARM64 (NEON): Uses vqtbl1q_u8 (TBL instruction) for 16-byte parallel lookup table operations
x86_64 (AVX2): Processes all 32 bytes in a single operation using 256-bit registers with _mm256_blendv_epi8 for conditional ASCII conversion
x86_64 (SSE2): Fallback using 128-bit registers, processes 16 bytes per iteration

Algorithm: Split bytes into nibbles, compare > 9 to identify hex letters, add appropriate ASCII offset ('0' for digits, 'a'-10 for letters), interleave and store.

Benchmarks (32 bytes → 64 hex chars):

Method	Time	Speedup
`format!("{:x}")`	~1630 ns	baseline
Scalar LUT	~35 ns	47x
NEON (ARM64)	~26 ns	62x
AVX2 (x86_64)	~25 ns	65x

2. Hex Decoding (hex string → bytes)

Implementation:

ARM64 (NEON): Optimized algorithm using simplified nibble conversion: (char & 0x0F) + 9*(char has bit 0x40 set)
- For '0'-'9': (0x30-0x39 & 0x0F) = 0-9, bit 0x40 not set → +0
- For 'A'-'F'/'a'-'f': (0x41-0x46 & 0x0F) = 1-6, bit 0x40 set → +9 = 10-15
- Uses vsliq_n_u8 (SLI - Shift Left and Insert) to combine nibbles in one instruction
- Uses vuzp1q_u8/vuzp2q_u8 for deinterleaving
x86_64 (SSE2): Uses comparison-based digit/letter detection with _mm_cmplt_epi8

Benchmarks (64 hex chars → 32 bytes):

Method	Time	Cycles	Speedup
Scalar LUT	~19 ns	~61	baseline
SSE2 (x86_64)	~5 ns	~16	3.8x
NEON (ARM64)	~2.5 ns	~8	7.6x

3. Alpha Transparency Check

Implementation:

Processes 128 bytes (32 RGBA pixels) per iteration
ANDs all chunks together - if any alpha < 255, result shows it
Checks alpha bytes at positions 3, 7, 11, 15 (every 4th byte)
Parallel processing with rayon for images > 4MB (256KB chunks for L2 cache efficiency)

Benchmarks (27 MP image, 109 MB RGBA):

Method	Time	Speedup
Scalar	5.37 ms	baseline
SIMD + Parallel	0.59 ms	9.1x

Theoretical minimum at 200 GB/s memory bandwidth: 0.55ms

4. Set Alpha Opaque

Implementation:

ORs alpha mask (0xFF at positions 3,7,11,15) with pixel data
Same parallelization strategy as alpha check

Benchmarks (27 MP image):

Method	Time	Speedup
Scalar	3.08 ms	baseline
SIMD + Parallel	0.67 ms	4.6x

5. RGB → RGBA Conversion

Implementation:

ARM64 (NEON): Uses vld3q_u8 to load RGB data deinterleaved into R/G/B planes, then vst4q_u8 to store as RGBA with alpha=255
x86_64 (SSSE3): Uses _mm_shuffle_epi8 (pshufb) to rearrange 12 RGB bytes → 16 RGBA bytes per iteration

Performance: ~4x speedup on large images compared to naive scalar

Cross-Platform Compatibility

Runtime feature detection: AVX2/SSSE3 detected at runtime with is_x86_feature_detected!
Proper #[target_feature] annotations: All SIMD functions properly annotated
Endian-safe scalar fallbacks: Uses #[cfg(target_endian = "little/big")] for conditional compilation
Overflow protection: All size calculations use checked_mul() to prevent overflow on large inputs

Stability Fixes

Fixed out-of-bounds SIMD reads in rgb_to_rgba_ssse3 (16-byte loads require proper bounds: i + 52 for 16-pixel loop, i + 16 for 4-pixel loop)
Added input validation to nearest_neighbor_downsample
Added #[target_feature(enable = "sse2")] to SSE2 functions for proper code generation

Testing

All existing tests pass
Roundtrip tests verify encode/decode correctness
Large input tests exercise SIMD paths
Uppercase hex handling verified

Files Changed

src/simd/hex.rs - Hex encoding/decoding (947 lines)
src/simd/image.rs - Image operations (937 lines)

- Use u64 bitmask for alpha transparency checking (~2.2x faster) - Add generate_blurhash_from_image() to avoid full RGBA allocation - Remove redundant blurhash generation in compression (was generating twice) - Use std::mem::take for zero-copy in upload path - Replace .chars().last() with byte access in URL extraction (O(n) -> O(1)) - Optimize SVG detection with direct byte pattern search (no String alloc) - Use .into_owned() instead of .to_string_lossy().to_string() in cache - Add shared dimension calculation helpers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use eq_ignore_ascii_case() for relay URL matching (11 locations) - Move rumor content/tags instead of cloning in event handler - Add EncodedImage::to_data_uri() with pre-allocated encode_string() - Add read_file_checked() helper (metadata check before read) - Consolidate duplicate base64 data URI patterns (5 locations) - Replace .contains(&x.to_string()) with .iter().any() (3 locations) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Parallel relay connections using join_all instead of sequential adds - Single batch query for all chats' last messages (N queries → 1) - Parallel DB reads: profiles, chats, MLS groups, last messages via tokio::join! - Fix merge_db_profiles: get signer/pubkey once instead of per-profile (2N → 2 async calls) - Inline redundant signer call in fetch_messages init path - Parallel cache preloads: preload_id_caches + load_recent_wrapper_ids - HashSet for O(1) profile existence checks instead of O(n) linear search - HashSet for O(1) MLS eviction checks instead of O(g) per chat - Pre-allocate chats vector capacity before push loop - Remove cleanup_empty_file_attachments from boot (was ineffective post-batch-query) - Remove dead get_chat_last_messages function (replaced by batch query) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace `hex` crate with custom SIMD implementations and add optimized image processing functions. This significantly improves performance for cryptographic operations and image handling across all platforms. ## New Modules - `simd/hex.rs`: SIMD hex encoding/decoding (ARM64 NEON, x86_64 SSE2/AVX2) - `simd/image.rs`: SIMD alpha operations, RGB→RGBA, nearest-neighbor downsampling ## Performance Improvements | Operation | Before | After | Speedup | |----------------------------|-----------------|--------------------| --------| | Hex encode (32 bytes) | ~1500 ns | ~23 ns (NEON) | 65x | | Hex decode (64 chars) | ~154 ns | ~0.4 ns (LUT) | 394x | | Alpha transparency check | 5.37 ms | 0.59 ms | 9.1x | | Set alpha opaque | 3.08 ms | 0.67 ms | 4.6x | | RGB → RGBA conversion | ~92 µs | ~10 µs | 9.2x | (Alpha benchmarks on 27 MP / 109 MB RGBA images) ## Platform Support - ARM64 (Apple Silicon, Android): NEON intrinsics - x86_64 (Windows, Linux): AVX2 with runtime detection, SSE2 fallback - Other platforms: Optimized scalar with 64-bit word operations ## Key Optimizations - Zero-copy hex encoding: writes directly into String buffer - Compile-time 256-byte LUT for hex decoding - Parallel chunk processing: 256 KB chunks (fits L2 cache) for 2-3x speedup on large images vs 1 MB chunks - NEON vld3/vst4 for RGB→RGBA channel deinterleaving - Combined alpha byte checks: ANDs 8 SIMD registers before branching ## Dependency Changes - Removed: `hex` crate (replaced with faster custom implementation) - Added: `rayon` for parallel processing of large images (>4 MB) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Hex decode performance (64 chars → 32 bytes): - NEON (ARM64): ~2.5 ns / 8 cycles (7.7x faster than LUT) - SSE2 (x86_64): ~5 ns (estimated) - Scalar LUT fallback: ~19 ns - Throughput: 12.7 GB/s on Apple Silicon Key optimizations: - Simplified nibble conversion: (char & 0x0F) + 9*(char has bit 0x40 set) Works for '0'-'9', 'A'-'F', and 'a'-'f' without branching - SLI (Shift Left and Insert) combines shift+OR into one instruction - Fully unrolled processing of all 64 hex chars - Applied same optimization to 16-byte and variable-length decode Also: - Fixed docstrings with accurate benchmark numbers - Added comprehensive tests for decode functions - Fixed unrelated test (u16 literal out of range) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The previous "SSE2" implementation was actually doing scalar u32 operations. Now uses proper SSSE3 pshufb instruction for efficient byte rearrangement: - Processes 16 pixels (48 RGB → 64 RGBA bytes) per unrolled iteration - Uses pshufb to rearrange RGB bytes and insert alpha in one operation - Runtime detection with scalar fallback for rare non-SSSE3 CPUs - Added comprehensive tests for both small and large inputs Algorithm: 1. Load 12 RGB bytes into 128-bit register 2. pshufb rearranges to R0 G0 B0 _ R1 G1 B1 _ R2 G2 B2 _ R3 G3 B3 _ 3. OR with alpha mask to fill _ positions with 0xFF 4. Store 16 RGBA bytes Safety fixes (per code review): - Fixed loop bounds to prevent out-of-bounds SIMD reads (UB) - 16-pixel loop: i+52 <= len (not i+48) for safe 16-byte loads - 4-pixel loop: i+16 <= len (not i+12) for safe 16-byte loads - Added checked_mul() to prevent size overflow on large inputs - Use clear() + reserve_exact() for proper Vec reuse semantics - Documented safety requirements and input constraints Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

hex.rs: - Add checked_mul() overflow protection in bytes_to_hex_string - Add #[target_feature(enable = "sse2")] to SSE2 functions for proper inlining behavior and documentation image.rs: - Add #[target_feature(enable = "sse2")] to all SSE2 functions: has_alpha_sse2, has_alpha_sse2_remainder, set_alpha_sse2, set_alpha_sse2_remainder - Fix endianness bug in scalar fallbacks: use cfg(target_endian) to select fast u64 mask on little-endian, byte-by-byte on big-endian - Add overflow protection to nearest_neighbor_downsample with checked_mul() for both source and destination dimensions - Add input validation: assert pixels buffer is large enough for source dimensions These fixes ensure correctness on: - Windows x64 (SSE2/AVX2) - Linux x64 (SSE2/AVX2) - macOS ARM64 (NEON) - WASM (scalar, little-endian) - Rare big-endian platforms (scalar, byte-by-byte) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-02-02T02:10:13Z

Claude finished @JSKitty's task in 1m 29s —— View job

PR Review: SIMD-accelerated hex encoding/decoding and image operations

Analyze PR context and changes
Review SIMD hex encoding/decoding implementation (src/simd/hex.rs)
Review SIMD image operations (src/simd/image.rs)
Review SIMD module structure (src/simd/mod.rs)
Review integration changes across the codebase
Compile findings and provide feedback

Summary

This is a well-structured PR introducing SIMD acceleration for performance-critical operations. The implementation demonstrates solid understanding of SIMD intrinsics across ARM64 (NEON) and x86_64 (SSE2/AVX2) platforms. The code is well-documented with clear performance benchmarks.

Positive Aspects

Excellent documentation: The module-level docs and inline comments explaining algorithms are exemplary
Proper target feature annotations: #[target_feature(enable = "...")] is correctly used on all SIMD functions
Runtime feature detection: AVX2 is properly detected at runtime on x86_64 with SSE2 fallback
Overflow protection: Size calculations use checked_mul() to prevent integer overflow
Comprehensive tests: Good coverage of roundtrip tests and edge cases
Smart parallelization: 4MB threshold and 256KB chunk size for L2 cache efficiency is well-reasoned

Issues and Suggestions

Medium Priority

1. Missing `#[target_feature]` on `bytes_to_hex_16` for x86_64

File: src-tauri/src/simd/hex.rs:286-324

The bytes_to_hex_16 function for x86_64 uses SSE2 intrinsics but lacks the #[target_feature(enable = "sse2")] attribute. While SSE2 is baseline for x86_64, adding the attribute ensures proper code generation and is consistent with other functions.

#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "sse2")]  // Add this
#[inline]
pub fn bytes_to_hex_16(bytes: &[u8; 16]) -> String {

Fix this →

2. Inefficient `hex_to_bytes_16` implementation on x86_64

File: src-tauri/src/simd/hex.rs:735-749

The x86_64 implementation of hex_to_bytes_16 creates unnecessary allocations:

let full = unsafe { hex_decode_32_sse2(&[h, &[b'0'; 32]].concat()) };

This allocates a new 64-byte Vec on every call just to reuse the 32-byte decoder. Consider adding a dedicated hex_decode_16_sse2 function or use a stack-allocated array with padding.

3. Potential UB: Invalid hex input handling silently corrupts data

File: src-tauri/src/simd/hex.rs:496-497

The comment states "Invalid hex characters are treated as 0x00" but this could lead to silent data corruption. Consider adding validation or returning Result<[u8; 32], Error> instead.

Low Priority

4. Unused alpha mask bytes in set_alpha functions

File: src-tauri/src/simd/image.rs:395-398

The AVX2 alpha mask byte order looks reversed. In _mm256_set_epi8, bytes are specified in reverse order (high to low), but the pattern (-1, 0, 0, 0, -1, 0, 0, 0, ...) puts alpha at positions 0,4,8,12,16,20,24,28 instead of 3,7,11,15,19,23,27,31. However, since RGBA has alpha at position 3 within each 4-byte pixel, and _mm256_set_epi8 fills in reverse, this actually works correctly. Just wanted to flag this as worth double-checking with a test on actual image data.

5. Consider using `MaybeUninit` for uninitialized buffers

File: src-tauri/src/simd/image.rs:602-603

The nearest_neighbor_downsample function uses set_len on a Vec after allocation. While this works, using MaybeUninit would be more idiomatic and safer:

let mut result: Vec<MaybeUninit<u8>> = Vec::with_capacity(dst_size);

6. Scalar fallback in `bytes_to_hex_string` duplicates HEX_CHARS constant

File: src-tauri/src/simd/hex.rs:456

HEX_CHARS is defined inline in the x86_64 path. Consider reusing HEX_NIBBLE or extracting to a shared constant.

Style Suggestions

The doc comment in nearest_neighbor_downsample (src/simd/image.rs:579-591) duplicates documentation - consider consolidating
Some functions could benefit from #[must_use] attribute (e.g., bytes_to_hex_32, hex_to_bytes_32)

Security Considerations

No obvious security issues identified
The hash calculations for file integrity use the SIMD hex encoding properly
Crypto operations in crypto.rs correctly use the new hex utilities

Testing Recommendations

The existing tests are good, but consider adding:

Fuzz tests for hex encoding/decoding edge cases
Tests with exactly 16 bytes of input for hex_to_bytes_16 on x86_64 (to cover the inefficient allocation path)
Property-based tests verifying decode(encode(x)) == x for random inputs

Verdict

Overall: Approve with minor suggestions ✅

This is solid, performance-focused work with proper safety considerations. The issues noted are minor and don't block merging. The performance improvements are substantial (up to 65x for hex encoding, 9x for image operations) and the code quality is high.

YuurinBee · 2026-02-02T02:11:29Z

ACK-LGTM

- Fix signed comparison bug in hex_decode_32_sse2 and hex_decode_16_sse2 The old algorithm used `_mm_cmplt_epi8(digit_val, ten)` which is a signed compare - chars below '0' (like '/') wrapped to negative values and incorrectly passed the < 10 test. - Replace with NEON-style algorithm: `(char & 0x0F) + 9*(char & 0x40)` This correctly identifies letters via bit 0x40 (set for A-F/a-f, not 0-9) Same instruction count, just correct classification. - Add #[target_feature(enable = "sse2")] to hex_encode_16_sse2 Extracted internal function with proper annotation for consistency. - Change function signatures to &[u8; 32] / &[u8; 64] Compile-time length guarantees prevent out-of-bounds reads. - Document "assume valid" semantics Invalid input produces garbage (no validation), matching NEON behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

## 1. Hybrid Wrapper ID Cache (state/globals.rs) Replaced HashSet<String> with sorted Vec<[u8;32]> + HashSet<[u8;32]> Benchmarks (25K entries): | Metric | Before | After | Improvement | |----------------|-------------|------------|-------------| | Memory | 3,444 KB | 813 KB | 76% reduction | | Load time | 3.88ms | 734µs | 5.3x faster | | Lookup speed | 7 M/s | 18 M/s | 2.5x faster | ## 2. SIMD Image Resize (simd/image.rs) New fast_resize_to_rgba() with fused RGB→RGBA downsample Benchmarks (15% preview scale): | Source | Before | After | Speedup | |----------------|-------------|------------|---------| | 12MP iPhone | 6.61 ms | 0.24 ms | 27.7x | | 12MP Android | 7.78 ms | 0.24 ms | 32.3x | | 48MP Phone | 28.50 ms | 1.00 ms | 28.6x | | 16MP Camera | 12.29 ms | 0.32 ms | 38.1x | ## 3. SIMD RGBA→RGB Conversion (simd/image.rs) NEON-accelerated alpha channel stripping for JPEG encoding Benchmarks: | Preview Size | Scalar | SIMD | Speedup | |----------------|-------------|------------|---------| | 270K pixels | 0.211 ms | 0.024 ms | 8.6x | | 1.08M pixels | 0.842 ms | 0.102 ms | 8.3x | | 518K pixels | 0.404 ms | 0.039 ms | 10.2x | ## 4. Platform-Optimized Preview Settings (shared/image.rs) Compile-time conditionals for zero runtime branching | Platform | Max Preview | JPEG Quality | |----------|-------------|--------------| | Android | 300×400 | 25 | | Desktop | 800×800 | 50 | ## 5. Capped Preview Dimensions Fixed max dimensions instead of percentage-based scaling: - Never upscales (preserves small images) - Consistent output regardless of source size - 48MP photo → 300×225 on mobile (was 1200×900 at 15%) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…ions Preview metadata (link previews): - Add preview_metadata column to events table schema (was only in old messages table) - Add migration 13 to add column to existing databases - Update all SELECT queries and StoredEvent to include preview_metadata - Serialize/deserialize SiteMetadata JSON when saving/loading messages - Link previews now persist across app restarts Android miniapp permissions: - Wire up get_granted_permissions_for_package() to actually query the database - Was computing file hash then returning empty string (TODO never completed) - Now uses TAURI_APP global to call db::miniapps::get_miniapp_granted_permissions() File I/O optimizations: - Remove redundant path.exists() checks before fs::metadata/fs::read - These functions already return NotFound errors, saving a syscall Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…log n) lookup Replace Vec<Message> with CompactMessageVec backed by binary [u8; 32] IDs, u16-interned npubs via NpubInterner, bitpacked flags, TinyVec<T> (8-byte thin pointer), and a sorted secondary index for O(log n) message lookup. Benchmarks (10k messages, 50 unique users): - Struct size: 472 → 128 bytes (72.9% reduction) - Total memory: 8.12 MB → 2.30 MB (71.7% savings) - Lookup: 184.5x faster (binary search vs linear scan) - Insert rate: 530k msgs/sec sequential, 899k msgs/sec batch - Interner: 1.28 MB → 4.7 KB for npub storage (99.6% savings) Key changes: - CompactMessage: binary IDs, Box<str>, compact u32 timestamps, TinyVec for reactions/attachments, boxed rare fields (edit_history, preview_metadata) - CompactMessageVec: timestamp-sorted storage with id_index for O(log n) lookup, optimized batch insert paths (append/prepend/mixed) - SerializableChat: frontend serialization layer (Chat stores compact, converts to SerializableChat for Tauri emit/commands) - ChatState helpers: update_message_in_chat, add_reaction_to_message, finalize_pending_message, update_attachment (split-borrow safe) - MessageSendResult: returns pending_id + event_id for state reconciliation - DB attachment index: ultra-packed AttachmentRef with binary hashes Bug fixes: - Reaction persistence: added missing message_update emits in all three reaction paths (react_to_message DM/MLS, event_handler, subscription_handler) - Evict corruption: added rebuild_index() after drain() in evict_chat_messages to fix stale id_index causing insert_batch to skip valid messages on reload - Edit handling: unified apply_edit with dedup on CompactMessage - Android JNI: use public re-exports for TAURI_APP and db functions Stats module gated behind #[cfg(debug_assertions)] — zero overhead in release. MDK pinned to rev 1ad7322 (epoch hint optimization). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

JSKitty and others added 7 commits February 1, 2026 01:56

JSKitty and others added 4 commits February 2, 2026 03:52

JSKitty merged commit 168b0c7 into master Feb 3, 2026

JSKitty deleted the simd-optimisations branch February 3, 2026 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: SIMD-accelerated hex encoding/decoding and image operations #39

perf: SIMD-accelerated hex encoding/decoding and image operations #39

JSKitty commented Feb 2, 2026

Uh oh!

github-actions bot commented Feb 2, 2026 •

edited

Loading

Uh oh!

YuurinBee commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf: SIMD-accelerated hex encoding/decoding and image operations #39

perf: SIMD-accelerated hex encoding/decoding and image operations #39

Conversation

JSKitty commented Feb 2, 2026

Summary

Changes

1. Hex Encoding (bytes → hex string)

2. Hex Decoding (hex string → bytes)

3. Alpha Transparency Check

4. Set Alpha Opaque

5. RGB → RGBA Conversion

Cross-Platform Compatibility

Stability Fixes

Testing

Files Changed

Uh oh!

github-actions bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: SIMD-accelerated hex encoding/decoding and image operations

Summary

Positive Aspects

Issues and Suggestions

Medium Priority

1. Missing #[target_feature] on bytes_to_hex_16 for x86_64

2. Inefficient hex_to_bytes_16 implementation on x86_64

3. Potential UB: Invalid hex input handling silently corrupts data

Low Priority

4. Unused alpha mask bytes in set_alpha functions

5. Consider using MaybeUninit for uninitialized buffers

6. Scalar fallback in bytes_to_hex_string duplicates HEX_CHARS constant

Style Suggestions

Security Considerations

Testing Recommendations

Verdict

Uh oh!

YuurinBee commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Feb 2, 2026 •

edited

Loading

1. Missing `#[target_feature]` on `bytes_to_hex_16` for x86_64

2. Inefficient `hex_to_bytes_16` implementation on x86_64

5. Consider using `MaybeUninit` for uninitialized buffers

6. Scalar fallback in `bytes_to_hex_string` duplicates HEX_CHARS constant