Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -337,3 +337,126 @@ Import-Module "C:\Program Files\Microsoft Visual Studio\2022\Professional\Common
```

These steps will initialize your environment and allow you to use the correct Visual Studio tools.

---

## POWER8 / PowerPC Support

bitnet.cpp has been ported to IBM POWER8 (ppc64le) with AltiVec/VSX SIMD optimizations.
This is the first port of BitNet to the PowerPC architecture.

### POWER8 Build

```bash
cd BitNet
mkdir build-ppc && cd build-ppc
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_FLAGS="-mcpu=power8 -mvsx -maltivec -O3 -mtune=power8 -funroll-loops" \
-DCMAKE_CXX_FLAGS="-mcpu=power8 -mvsx -maltivec -O3 -mtune=power8 -funroll-loops -std=c++17"
make -j$(nproc)
```

### POWER8 Optimizations

Three levels of optimization are implemented:

1. **Scalar fallback** — Baseline C code for any PowerPC target
2. **VSX vec_msum kernels** — Uses `vmsummbm` instruction for 16-way signed×unsigned byte multiply-accumulate per cycle. All 5 I2_S kernel functions are vectorized: `quantize_i2_s`, `1x1`, `1x4_32W`, `1xN`, `Nx1`
3. **L3 resident dcbt prefetch** — Uses `dcbt` with TH=0x10 hint to keep weight tensors pinned in L3 cache between token generation steps, avoiding DRAM re-fetch

### POWER8 Benchmarks

**Hardware**: IBM Power System S824 (8286-42A), Dual 8-core POWER8 (16c/128t SMT8), 512 GB DDR3, Ubuntu 20.04 LTS
**Run config**: 64 threads, `numactl --interleave=all`, `OMP_PROC_BIND=spread`

#### Scalar → VSX Speedup

| Model | Size | pp128 (scalar) | pp128 (VSX) | Speedup |
|-------|------|----------------|-------------|---------|
| BitNet 700M | 257 MiB | 21.48 t/s | 211.48 t/s | **9.8x** |
| BitNet 2B | 1.71 GiB | 8.04 t/s | 73.03 t/s | **9.1x** |
| Llama3-8B BitNet | 3.58 GiB | 2.60 t/s | 27.39 t/s | **10.5x** |

#### Full Results (VSX + dcbt resident prefetch)

| Model | Size | Params | pp128 | pp256 | pp512 | tg32 |
|-------|------|--------|-------|-------|-------|------|
| BitNet 700M | 257 MiB | 728.84 M | 209.38 t/s | 176.67 t/s | 134.10 t/s | 24.02 t/s |
| BitNet 2B | 1.71 GiB | 2.74 B | 71.95 t/s | 64.98 t/s | 52.67 t/s | 11.99 t/s |
| Llama3-8B BitNet | 3.58 GiB | 8.03 B | 26.98 t/s | 25.06 t/s | 21.70 t/s | 5.63 t/s |

#### Total Speedup vs Scalar Baseline

| Model | pp128 | tg32 |
|-------|-------|------|
| 700M | **9.7x** | **2.2x** |
| 2B | **9.0x** | **2.9x** |
| 8B | **10.4x** | **3.5x** |

### Key Technical Details

- **vec_msum (vmsummbm)**: One POWER8 instruction multiplies 16 signed×unsigned byte pairs and accumulates to 4 int32 lanes — ideal for I2_S ternary {-1, 0, 1} dot products
- **dcbt resident (TH=0x10)**: Tells POWER8 cache controller to keep data sticky in L3 rather than LRU eviction — gives +5-15% on token generation
- **Optimal threads**: 64 (not 128) — SMT8 causes cache thrashing at full thread count
- **NUMA**: `--interleave=all` required for models spanning both memory nodes

### POWER8 Models

Tested with:
- [microsoft/BitNet-b1.58-2B-4T](https://huggingface.co/microsoft/BitNet-b1.58-2B-4T) (I2_S quantized)
- [1bitLLM/bitnet_b1_58-large](https://huggingface.co/1bitLLM/bitnet_b1_58-large) (700M)
- [HF1BitLLM/Llama3-8B-1.58-100B-tokens](https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens) (converted via `convert-hf-to-gguf-bitnet.py --outtype f32` then `llama-quantize` to I2_S)

### Power Mac G5 (Big-Endian) Support

bitnet.cpp also runs on Power Mac G5 (PowerPC 970, big-endian) with Mac OS X 10.5 Leopard.
This required solving the GGUF big-endian byte-swap problem: GGUF is always little-endian on disk,
so all multi-byte scalar values and tensor data must be byte-swapped when reading on big-endian hosts.

#### G5 Big-Endian Patches

The `patches/` directory contains everything needed:

- **`g5-big-endian.patch`** — Adds `gguf_fread_val()` byte-swap function and patches all GGUF scalar reads (header, KV pairs, tensor info). Also adds tensor data byte-swap for F32, F16, and I2_S scale at load time. Fixes `sizeof(bool)==4` on PowerPC GCC.
- **`regex-ppc.h`** — POSIX regex wrapper replacing `std::regex` which crashes with Bus error on PPC big-endian (GCC libstdc++ bug).
- **`build_g5.sh`** — Build script that applies patches and compiles with G5-safe flags.

#### G5 Build

```bash
cd BitNet
./patches/build_g5.sh /usr/local/gcc-10/bin
```

Or manually:
```bash
cd 3rdparty/llama.cpp
git apply ../../patches/g5-big-endian.patch
cp ../../patches/regex-ppc.h common/
make -j2 CC=/usr/local/gcc-10/bin/gcc CXX=/usr/local/gcc-10/bin/g++ \
GGML_NO_METAL=1 LLAMA_NO_ACCELERATE=1 LLAMA_NO_LLAMAFILE=1 "GGML_NO_OPENMP=" \
MK_CFLAGS="-mcpu=970 -maltivec -Os -fno-strict-aliasing -I ggml/include" \
MK_CXXFLAGS="-mcpu=970 -maltivec -Os -fno-strict-aliasing -std=gnu++17 -I ggml/include -include common/regex-ppc.h" \
MK_LDFLAGS="-L/usr/local/gcc-10/lib -lgomp" \
llama-cli
```

#### G5 Benchmarks

**Hardware**: Power Mac G5 Dual 2.0 GHz (PowerPC 970), 8 GB DDR2, Mac OS X 10.5.8 Leopard
**Compiler**: GCC 10.5.0, `-Os -mcpu=970 -maltivec`

| Model | Size | pp5 | tg30 | Notes |
|-------|------|-----|------|-------|
| BitNet 700M | 257 MiB | 4.31 t/s | 1.61 t/s | Scalar I2_S, 2 threads |

#### G5 Key Details

- **Optimization level**: `-Os` is the highest safe level. `-O2` and `-O3` cause Bus errors from instruction scheduling on PowerPC 970.
- **GGUF byte-swap**: All GGUF numeric fields read through `gguf_fread_val()` which byte-swaps on `__BIG_ENDIAN__`. String data and raw tensor bytes use `gguf_fread_el()` (no swap).
- **I2_S tensor layout**: Quantized uint8 bytes are endian-independent. Only the trailing float scale (at offset `ne0*ne1/4`) needs byte-swap.
- **`sizeof(bool)`**: PowerPC GCC defines `sizeof(bool)==4` but GGUF stores bools as 1 byte. Fixed with compile-time conditional.
- **`--no-mmap` required**: Mac OS X 10.5 mmap behavior differs; use `--no-mmap` flag.

Developed by [Elyan Labs](https://github.com/Scottcjn).
9 changes: 9 additions & 0 deletions include/bitnet-lut-kernels.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
// Stub LUT kernels header for POWER8 port
// BitNet LUT kernels are x86/ARM specific - POWER8 uses I2_S (MAD) path
// TODO: Implement vec_perm based LUT kernels for POWER8 VSX

#pragma once

// Empty stubs - LUT path not used on PowerPC
// The I2_S (multiply-accumulate-decompose) path is used instead

11 changes: 11 additions & 0 deletions include/gemm-config.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,5 +31,16 @@
#define PARALLEL_SIZE 4
#endif // ACT_PARALLEL
#endif // __ARM_FEATURE_DOTPROD
#elif defined(__VSX__) || defined(__ALTIVEC__) || defined(__powerpc64__) || defined(__powerpc__) || defined(__ppc__)
// PowerPC (G5 AltiVec / POWER8 VSX)
#if defined(ACT_PARALLEL)
#define ROW_BLOCK_SIZE 4
#define COL_BLOCK_SIZE 128
#define PARALLEL_SIZE 4
#else
#define ROW_BLOCK_SIZE 128
#define COL_BLOCK_SIZE 32
#define PARALLEL_SIZE 8
#endif // ACT_PARALLEL
#endif // __AVX__

87 changes: 87 additions & 0 deletions patches/build_g5.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
#!/bin/bash
# build_g5.sh - Build BitNet for Power Mac G5 (big-endian PowerPC AltiVec)
#
# Requirements:
# - Mac OS X 10.5 Leopard (or Linux ppc64be)
# - GCC 10+ with C++17 support
# - Model file: bitnet_b1_58-large converted to GGUF I2_S format
#
# The AltiVec SIMD kernels use the same code path as POWER8 VSX,
# abstracted through compatibility macros in ggml-bitnet-mad.cpp.
# Key operations: vec_msum (vmsummbm), vec_ld, vec_splat_u8.
#
# Usage:
# ./patches/build_g5.sh [GCC_PREFIX]
#
# Example:
# ./patches/build_g5.sh /usr/local/gcc-10/bin
# ./patches/build_g5.sh # uses gcc/g++ from PATH

set -e

GCC_PREFIX="${1:-}"
if [ -n "$GCC_PREFIX" ]; then
CC="${GCC_PREFIX}/gcc"
CXX="${GCC_PREFIX}/g++"
else
CC="gcc"
CXX="g++"
fi

echo "=== BitNet G5 AltiVec Build ==="
echo "CC: $CC"
echo "CXX: $CXX"
echo ""

# Step 1: Apply big-endian patches to llama.cpp submodule
echo ">>> Step 1: Applying big-endian patches..."
cd 3rdparty/llama.cpp
if git diff --quiet HEAD 2>/dev/null; then
git apply ../../patches/g5-big-endian.patch
echo " Applied g5-big-endian.patch"
else
echo " Submodule already has local changes, skipping patch"
fi

# Step 2: Copy regex compatibility header
echo ">>> Step 2: Installing regex-ppc.h..."
cp ../../patches/regex-ppc.h common/regex-ppc.h
echo " Installed common/regex-ppc.h"

# Step 3: Build using Makefile with G5 AltiVec flags
# -Os is required: -O2 and -O3 cause Bus errors on G5 due to Mach-O ABI
# stack alignment issues when GCC generates aggressive vector register spills.
# -include common/regex-ppc.h replaces broken std::regex on PPC BE
echo ">>> Step 3: Building llama-cli with AltiVec flags..."
echo " (This takes several minutes on dual G5)"
echo " NOTE: Use -t 1 for inference (single thread is faster due to"
echo " barrier overhead on 870 graph nodes per token)"

make -j2 \
CC="$CC" \
CXX="$CXX" \
GGML_NO_METAL=1 \
LLAMA_NO_ACCELERATE=1 \
LLAMA_NO_LLAMAFILE=1 \
"GGML_NO_OPENMP=" \
MK_CFLAGS="-mcpu=970 -maltivec -Os -I ggml/include" \
MK_CXXFLAGS="-mcpu=970 -maltivec -Os -std=gnu++17 -I ggml/include -include common/regex-ppc.h" \
MK_LDFLAGS="-L$(dirname $CC)/../lib -lgomp" \
llama-cli

echo ""
echo "=== Build complete ==="
echo ""
echo "Run inference with:"
echo " ./3rdparty/llama.cpp/llama-cli \\"
echo " -m <model>.gguf \\"
echo " -p \"Once upon a time\" \\"
echo " -n 30 -t 1 --no-warmup --no-mmap"
echo ""
echo "Performance: pp6 ~4.7 t/s, tg ~1.7 t/s (AltiVec, -Os, -t 1)"
echo ""
echo "NOTE: AltiVec dot product kernels are 16x faster than scalar"
echo "(verified by microbenchmark), but end-to-end speedup is limited"
echo "by Amdahl's law: matmul is only 12-24% of total inference time."
echo "The remaining time is framework overhead (layernorm, softmax,"
echo "RoPE, activation quantization, 870 barrier syncs per token)."
Loading