README
PascalVal is a Linux-based diagnostic tool that validates NVIDIA GPU health and system integration. Instead of chasing peak benchmark scores, it targets the three subsystems most responsible for slowdowns and instability:
- PCIe Transport — link width, generation, effective bandwidth
- Memory Paths — explicit transfers and Unified Memory behavior
- Compute Health — SGEMM throughput, stability, and drift
Goal: confidently answer one critical question:
Is this GPU—and its integration into this system—operating as expected?
Each execution follows a gated validation pipeline:
PascalVal performs a bounded PCIe transport validation:
- Verifies negotiated link generation and width
- Measures host↔device bandwidth
- Detects lane drops, fallback modes, and platform-level transport issues
For deep PCIe diagnostics and bandwidth characterization (standalone toolchain), see:
https://github.com/parallelArchitect/gpu-pcie-diagnostic
PascalVal performs a bounded Unified Memory validation focused on device residency behavior.
- Measures device-initiated Unified Memory prefetch throughput
- Explicitly does not run naive bulk bandwidth tests
- Avoids misleading results caused by pageable copies or host-side effects
The UM module answers a narrow validation question:
Is Unified Memory resident on the device and servicing accesses through the expected fast path?
For deep Unified Memory analysis (migration paths, fault behavior, oversubscription), see:
https://github.com/parallelArchitect/pascal-um-benchmark
- Executes a fixed-size SGEMM workload representative of scientific and ML compute
- Reports achieved GFLOPS relative to theoretical capability
If a gate fails, downstream stages do not execute.
This prevents misleading compute results when the system is already unhealthy.
Each execution produces a single evidence artifact:
results/json/<run_id>/run.json
This file is the source of truth for all analysis.
Yes.
PascalVal runs on all NVIDIA GPUs supported by the Linux NVIDIA driver.
On every NVIDIA GPU, PascalVal provides:
- PCIe validation
- Unified Memory validation (where supported)
- Compute validation using cuBLAS
- Historical evaluation via the analyzer
This is the portable baseline path.
On Pascal GP104 (SM 6.1) GPUs (for example GTX 1080 / 1070 / 1060-6GB):
- The validator automatically selects a custom SGEMM kernel
- The kernel uses a fixed, Pascal-tuned execution configuration
- No user tuning or manual selection is required
On all other NVIDIA GPUs, PascalVal automatically uses cuBLAS as the compute validation path.
This routing decision is policy-driven and recorded in each run.json artifact.
The Pascal GP104 custom SGEMM binary (sgemm_gp104) is built as a hardened Linux executable:
- PIE (ASLR)
- Non-executable stack (NX)
- Immediate binding (RELRO / BIND_NOW)
The binary is non-networked, dynamically linked, and performs compute only.
It does not modify system state.
Repeated validation under controlled conditions shows expected differences:
- General-purpose, adaptive library
- Runtime kernel selection and scheduling
- Sensitive to boost state, power management, and background activity
Role: Portable reference and sanity check.
- Fixed kernel configuration and launch geometry
- Minimal runtime decision-making
- Lower variance and higher determinism under identical conditions
Role: Preferred signal for long-term health evaluation on supported Pascal GPUs.
cuBLAS answers: “Is the system broadly functional?”
Custom SGEMM answers: “Is this Pascal GPU operating within its expected deterministic compute envelope?”
Both are valid.
They are not equivalent signals.
PascalVal is a Linux-based tool for NVIDIA GPUs.
Requirements
- Linux
- NVIDIA GPU with a working NVIDIA driver (NVML available)
- Python 3.9+
git clone https://github.com/parallelArchitect/pascal-gpu-val.git
cd pascal-gpu-val
pip install -r requirements.txtpython3 run_validation.pyThis executes the gated pipeline and writes one immutable evidence artifact to:
results/json/<run_id>/run.json
python3 run_validation.py --all-gpusRuns one validation per GPU and produces one artifact per GPU.
python3 tools/analyze_runs.py --engine all --out results/analysis/analyzer_report.jsonThe analyzer:
- Reads existing
run.jsonartifacts (read-only) - Builds robust baselines (median + MAD)
- Reports whether deviations are isolated or persistent over time
PascalVal is intended to be executed repeatedly over time as part of normal system operation.
- Each execution appends one immutable artifact under
results/json/ - Artifacts are append-only
- No single run is treated as authoritative
Behavior is evaluated across multiple runs using the analyzer.
Meaningful historical analysis requires multiple runs.
- Fewer than 12 valid runs → results are reported as insufficient data
- 12+ valid runs → baseline statistics (median, MAD) become stable
- Larger histories improve confidence and drift sensitivity
The analyzer explicitly reports when available evidence is insufficient to support conclusions.
The analyzer operates read-only on existing run.json artifacts.
It evaluates recent history to determine whether observed deviations are:
- Isolated (expected variance), or
- Persistent over time (behavioral drift)
Baselines are derived using robust statistics (median and MAD).
Derived reports are written to:
results/analysis/analyzer_report.json
results/
json/<run_id>/run.json # primary evidence artifact (append-only)
analysis/analyzer_report.json # derived report (read-only)
See the full system diagram and file-level boundaries in
docs/architecture/ARCHITECTURE.md.
PascalVal is intentionally split into two roles:
- PCIe
- Unified Memory
- Compute
- Gate ordering
- Evaluation
- Verdict generation
- Artifact writing
PascalVal is:
- A Linux NVIDIA GPU health validator
- A drift-evaluation tool
- A pre-compute sanity check
PascalVal is not:
- A peak benchmark
- A tuning framework
- An application profiler
The tools/ directory contains optional, offline utilities that analyze historical
run.json artifacts produced by PascalVal. These tools operate in read-only mode
and never influence validation behavior or results.
Typical uses include:
- Aggregating multiple validation runs
- Evaluating variance and persistent drift over time
- Producing summary reports for long-term GPU health monitoring
MIT License
Author: Joe McLaren — Human–AI collaborative engineering
For professional inquiries related to GPU validation, system diagnostics, or performance analysis, contact: