-
Notifications
You must be signed in to change notification settings - Fork 7
perf: SIMD-accelerated hex encoding/decoding and image operations #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Use u64 bitmask for alpha transparency checking (~2.2x faster) - Add generate_blurhash_from_image() to avoid full RGBA allocation - Remove redundant blurhash generation in compression (was generating twice) - Use std::mem::take for zero-copy in upload path - Replace .chars().last() with byte access in URL extraction (O(n) -> O(1)) - Optimize SVG detection with direct byte pattern search (no String alloc) - Use .into_owned() instead of .to_string_lossy().to_string() in cache - Add shared dimension calculation helpers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use eq_ignore_ascii_case() for relay URL matching (11 locations) - Move rumor content/tags instead of cloning in event handler - Add EncodedImage::to_data_uri() with pre-allocated encode_string() - Add read_file_checked() helper (metadata check before read) - Consolidate duplicate base64 data URI patterns (5 locations) - Replace .contains(&x.to_string()) with .iter().any() (3 locations) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Parallel relay connections using join_all instead of sequential adds - Single batch query for all chats' last messages (N queries → 1) - Parallel DB reads: profiles, chats, MLS groups, last messages via tokio::join! - Fix merge_db_profiles: get signer/pubkey once instead of per-profile (2N → 2 async calls) - Inline redundant signer call in fetch_messages init path - Parallel cache preloads: preload_id_caches + load_recent_wrapper_ids - HashSet for O(1) profile existence checks instead of O(n) linear search - HashSet for O(1) MLS eviction checks instead of O(g) per chat - Pre-allocate chats vector capacity before push loop - Remove cleanup_empty_file_attachments from boot (was ineffective post-batch-query) - Remove dead get_chat_last_messages function (replaced by batch query) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace `hex` crate with custom SIMD implementations and add optimized image processing functions. This significantly improves performance for cryptographic operations and image handling across all platforms. ## New Modules - `simd/hex.rs`: SIMD hex encoding/decoding (ARM64 NEON, x86_64 SSE2/AVX2) - `simd/image.rs`: SIMD alpha operations, RGB→RGBA, nearest-neighbor downsampling ## Performance Improvements | Operation | Before | After | Speedup | |----------------------------|-----------------|--------------------| --------| | Hex encode (32 bytes) | ~1500 ns | ~23 ns (NEON) | 65x | | Hex decode (64 chars) | ~154 ns | ~0.4 ns (LUT) | 394x | | Alpha transparency check | 5.37 ms | 0.59 ms | 9.1x | | Set alpha opaque | 3.08 ms | 0.67 ms | 4.6x | | RGB → RGBA conversion | ~92 µs | ~10 µs | 9.2x | (Alpha benchmarks on 27 MP / 109 MB RGBA images) ## Platform Support - ARM64 (Apple Silicon, Android): NEON intrinsics - x86_64 (Windows, Linux): AVX2 with runtime detection, SSE2 fallback - Other platforms: Optimized scalar with 64-bit word operations ## Key Optimizations - Zero-copy hex encoding: writes directly into String buffer - Compile-time 256-byte LUT for hex decoding - Parallel chunk processing: 256 KB chunks (fits L2 cache) for 2-3x speedup on large images vs 1 MB chunks - NEON vld3/vst4 for RGB→RGBA channel deinterleaving - Combined alpha byte checks: ANDs 8 SIMD registers before branching ## Dependency Changes - Removed: `hex` crate (replaced with faster custom implementation) - Added: `rayon` for parallel processing of large images (>4 MB) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Hex decode performance (64 chars → 32 bytes): - NEON (ARM64): ~2.5 ns / 8 cycles (7.7x faster than LUT) - SSE2 (x86_64): ~5 ns (estimated) - Scalar LUT fallback: ~19 ns - Throughput: 12.7 GB/s on Apple Silicon Key optimizations: - Simplified nibble conversion: (char & 0x0F) + 9*(char has bit 0x40 set) Works for '0'-'9', 'A'-'F', and 'a'-'f' without branching - SLI (Shift Left and Insert) combines shift+OR into one instruction - Fully unrolled processing of all 64 hex chars - Applied same optimization to 16-byte and variable-length decode Also: - Fixed docstrings with accurate benchmark numbers - Added comprehensive tests for decode functions - Fixed unrelated test (u16 literal out of range) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous "SSE2" implementation was actually doing scalar u32 operations. Now uses proper SSSE3 pshufb instruction for efficient byte rearrangement: - Processes 16 pixels (48 RGB → 64 RGBA bytes) per unrolled iteration - Uses pshufb to rearrange RGB bytes and insert alpha in one operation - Runtime detection with scalar fallback for rare non-SSSE3 CPUs - Added comprehensive tests for both small and large inputs Algorithm: 1. Load 12 RGB bytes into 128-bit register 2. pshufb rearranges to R0 G0 B0 _ R1 G1 B1 _ R2 G2 B2 _ R3 G3 B3 _ 3. OR with alpha mask to fill _ positions with 0xFF 4. Store 16 RGBA bytes Safety fixes (per code review): - Fixed loop bounds to prevent out-of-bounds SIMD reads (UB) - 16-pixel loop: i+52 <= len (not i+48) for safe 16-byte loads - 4-pixel loop: i+16 <= len (not i+12) for safe 16-byte loads - Added checked_mul() to prevent size overflow on large inputs - Use clear() + reserve_exact() for proper Vec reuse semantics - Documented safety requirements and input constraints Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
hex.rs: - Add checked_mul() overflow protection in bytes_to_hex_string - Add #[target_feature(enable = "sse2")] to SSE2 functions for proper inlining behavior and documentation image.rs: - Add #[target_feature(enable = "sse2")] to all SSE2 functions: has_alpha_sse2, has_alpha_sse2_remainder, set_alpha_sse2, set_alpha_sse2_remainder - Fix endianness bug in scalar fallbacks: use cfg(target_endian) to select fast u64 mask on little-endian, byte-by-byte on big-endian - Add overflow protection to nearest_neighbor_downsample with checked_mul() for both source and destination dimensions - Add input validation: assert pixels buffer is large enough for source dimensions These fixes ensure correctness on: - Windows x64 (SSE2/AVX2) - Linux x64 (SSE2/AVX2) - macOS ARM64 (NEON) - WASM (scalar, little-endian) - Rare big-endian platforms (scalar, byte-by-byte) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Claude finished @JSKitty's task in 1m 29s —— View job PR Review: SIMD-accelerated hex encoding/decoding and image operations
SummaryThis is a well-structured PR introducing SIMD acceleration for performance-critical operations. The implementation demonstrates solid understanding of SIMD intrinsics across ARM64 (NEON) and x86_64 (SSE2/AVX2) platforms. The code is well-documented with clear performance benchmarks. Positive Aspects
Issues and SuggestionsMedium Priority1. Missing
|
|
ACK-LGTM |
- Fix signed comparison bug in hex_decode_32_sse2 and hex_decode_16_sse2 The old algorithm used `_mm_cmplt_epi8(digit_val, ten)` which is a signed compare - chars below '0' (like '/') wrapped to negative values and incorrectly passed the < 10 test. - Replace with NEON-style algorithm: `(char & 0x0F) + 9*(char & 0x40)` This correctly identifies letters via bit 0x40 (set for A-F/a-f, not 0-9) Same instruction count, just correct classification. - Add #[target_feature(enable = "sse2")] to hex_encode_16_sse2 Extracted internal function with proper annotation for consistency. - Change function signatures to &[u8; 32] / &[u8; 64] Compile-time length guarantees prevent out-of-bounds reads. - Document "assume valid" semantics Invalid input produces garbage (no validation), matching NEON behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
## 1. Hybrid Wrapper ID Cache (state/globals.rs) Replaced HashSet<String> with sorted Vec<[u8;32]> + HashSet<[u8;32]> Benchmarks (25K entries): | Metric | Before | After | Improvement | |----------------|-------------|------------|-------------| | Memory | 3,444 KB | 813 KB | 76% reduction | | Load time | 3.88ms | 734µs | 5.3x faster | | Lookup speed | 7 M/s | 18 M/s | 2.5x faster | ## 2. SIMD Image Resize (simd/image.rs) New fast_resize_to_rgba() with fused RGB→RGBA downsample Benchmarks (15% preview scale): | Source | Before | After | Speedup | |----------------|-------------|------------|---------| | 12MP iPhone | 6.61 ms | 0.24 ms | 27.7x | | 12MP Android | 7.78 ms | 0.24 ms | 32.3x | | 48MP Phone | 28.50 ms | 1.00 ms | 28.6x | | 16MP Camera | 12.29 ms | 0.32 ms | 38.1x | ## 3. SIMD RGBA→RGB Conversion (simd/image.rs) NEON-accelerated alpha channel stripping for JPEG encoding Benchmarks: | Preview Size | Scalar | SIMD | Speedup | |----------------|-------------|------------|---------| | 270K pixels | 0.211 ms | 0.024 ms | 8.6x | | 1.08M pixels | 0.842 ms | 0.102 ms | 8.3x | | 518K pixels | 0.404 ms | 0.039 ms | 10.2x | ## 4. Platform-Optimized Preview Settings (shared/image.rs) Compile-time conditionals for zero runtime branching | Platform | Max Preview | JPEG Quality | |----------|-------------|--------------| | Android | 300×400 | 25 | | Desktop | 800×800 | 50 | ## 5. Capped Preview Dimensions Fixed max dimensions instead of percentage-based scaling: - Never upscales (preserves small images) - Consistent output regardless of source size - 48MP photo → 300×225 on mobile (was 1200×900 at 15%) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ions Preview metadata (link previews): - Add preview_metadata column to events table schema (was only in old messages table) - Add migration 13 to add column to existing databases - Update all SELECT queries and StoredEvent to include preview_metadata - Serialize/deserialize SiteMetadata JSON when saving/loading messages - Link previews now persist across app restarts Android miniapp permissions: - Wire up get_granted_permissions_for_package() to actually query the database - Was computing file hash then returning empty string (TODO never completed) - Now uses TAURI_APP global to call db::miniapps::get_miniapp_granted_permissions() File I/O optimizations: - Remove redundant path.exists() checks before fs::metadata/fs::read - These functions already return NotFound errors, saving a syscall Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…log n) lookup Replace Vec<Message> with CompactMessageVec backed by binary [u8; 32] IDs, u16-interned npubs via NpubInterner, bitpacked flags, TinyVec<T> (8-byte thin pointer), and a sorted secondary index for O(log n) message lookup. Benchmarks (10k messages, 50 unique users): - Struct size: 472 → 128 bytes (72.9% reduction) - Total memory: 8.12 MB → 2.30 MB (71.7% savings) - Lookup: 184.5x faster (binary search vs linear scan) - Insert rate: 530k msgs/sec sequential, 899k msgs/sec batch - Interner: 1.28 MB → 4.7 KB for npub storage (99.6% savings) Key changes: - CompactMessage: binary IDs, Box<str>, compact u32 timestamps, TinyVec for reactions/attachments, boxed rare fields (edit_history, preview_metadata) - CompactMessageVec: timestamp-sorted storage with id_index for O(log n) lookup, optimized batch insert paths (append/prepend/mixed) - SerializableChat: frontend serialization layer (Chat stores compact, converts to SerializableChat for Tauri emit/commands) - ChatState helpers: update_message_in_chat, add_reaction_to_message, finalize_pending_message, update_attachment (split-borrow safe) - MessageSendResult: returns pending_id + event_id for state reconciliation - DB attachment index: ultra-packed AttachmentRef with binary hashes Bug fixes: - Reaction persistence: added missing message_update emits in all three reaction paths (react_to_message DM/MLS, event_handler, subscription_handler) - Evict corruption: added rebuild_index() after drain() in evict_chat_messages to fix stale id_index causing insert_batch to skip valid messages on reload - Edit handling: unified apply_edit with dedup on CompactMessage - Android JNI: use public re-exports for TAURI_APP and db functions Stats module gated behind #[cfg(debug_assertions)] — zero overhead in release. MDK pinned to rev 1ad7322 (epoch hint optimization). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
This PR introduces comprehensive SIMD acceleration for performance-critical operations in Vector's backend, targeting both ARM64 (Apple Silicon, Android) and x86_64 (Windows, Linux) platforms.
Changes
1. Hex Encoding (bytes → hex string)
Implementation:
vqtbl1q_u8(TBL instruction) for 16-byte parallel lookup table operations_mm256_blendv_epi8for conditional ASCII conversionAlgorithm: Split bytes into nibbles, compare > 9 to identify hex letters, add appropriate ASCII offset ('0' for digits, 'a'-10 for letters), interleave and store.
Benchmarks (32 bytes → 64 hex chars):
format!("{:x}")2. Hex Decoding (hex string → bytes)
Implementation:
(char & 0x0F) + 9*(char has bit 0x40 set)(0x30-0x39 & 0x0F) = 0-9, bit 0x40 not set → +0(0x41-0x46 & 0x0F) = 1-6, bit 0x40 set → +9 = 10-15vsliq_n_u8(SLI - Shift Left and Insert) to combine nibbles in one instructionvuzp1q_u8/vuzp2q_u8for deinterleaving_mm_cmplt_epi8Benchmarks (64 hex chars → 32 bytes):
3. Alpha Transparency Check
Implementation:
Benchmarks (27 MP image, 109 MB RGBA):
Theoretical minimum at 200 GB/s memory bandwidth: 0.55ms
4. Set Alpha Opaque
Implementation:
Benchmarks (27 MP image):
5. RGB → RGBA Conversion
Implementation:
vld3q_u8to load RGB data deinterleaved into R/G/B planes, thenvst4q_u8to store as RGBA with alpha=255_mm_shuffle_epi8(pshufb) to rearrange 12 RGB bytes → 16 RGBA bytes per iterationPerformance: ~4x speedup on large images compared to naive scalar
Cross-Platform Compatibility
is_x86_feature_detected!#[target_feature]annotations: All SIMD functions properly annotated#[cfg(target_endian = "little/big")]for conditional compilationchecked_mul()to prevent overflow on large inputsStability Fixes
rgb_to_rgba_ssse3(16-byte loads require proper bounds:i + 52for 16-pixel loop,i + 16for 4-pixel loop)nearest_neighbor_downsample#[target_feature(enable = "sse2")]to SSE2 functions for proper code generationTesting
Files Changed
src/simd/hex.rs- Hex encoding/decoding (947 lines)src/simd/image.rs- Image operations (937 lines)