ENH: use limited C API to produce abi3 wheels for with-GIL interpreters in release builds #828

rgommers · 2025-10-19T20:12:44Z

The default for a local build is still to produce cp3xx wheels; when passing a build flag it is now possible to opt into producing wheels for the Stable ABI. This reduces the number of wheels to build from 5 to 3 per platform.

TODO: benchmarking to ensure we don't lose a significant amount of performance.

rgommers · 2026-01-19T21:34:07Z

After fixing the benchmarks in gh-835, I got around to running some benchmarks. It looks like for very small image sizes, there is quite a lot of overhead. For operations that take about 1 ms or more, the difference seems very small. Example:

$ pixi r bench --compare -t Dwt2TimeSuite
✨ Pixi task (bench in default): spin bench --compare -t Dwt2TimeSuite                                                                                                                                                                                         $ cd benchmarks
$ asv continuous --interleave-rounds --factor 1.05 --bench Dwt2TimeSuite b8ddd4d843ffcd82f8569e9988115d3a11d11120 bb4e0e3d59fb165266f68f277c443f73cdf944d5
· `wheel_cache_size` has been renamed to `build_cache_size`. Update your `asv.conf.json` accordingly.
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.14-Cython-meson-python-numpy
·· Installing bb4e0e3d <bench-limited-api> into virtualenv-py3.14-Cython-meson-python-numpy..
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For pywt commit b8ddd4d8 <main> (round 1/2):
[ 0.00%] ·· Building for virtualenv-py3.14-Cython-meson-python-numpy..
[ 0.00%] ·· Benchmarking virtualenv-py3.14-Cython-meson-python-numpy
[25.00%] ··· Running (dwt_benchmarks.Dwt2TimeSuite.time_dwt2--).
[25.00%] · For pywt commit bb4e0e3d <bench-limited-api> (round 1/2):
[25.00%] ·· Building for virtualenv-py3.14-Cython-meson-python-numpy..
[25.00%] ·· Benchmarking virtualenv-py3.14-Cython-meson-python-numpy
[50.00%] ··· Running (dwt_benchmarks.Dwt2TimeSuite.time_dwt2--).
[50.00%] · For pywt commit bb4e0e3d <bench-limited-api> (round 2/2):
[50.00%] ·· Benchmarking virtualenv-py3.14-Cython-meson-python-numpy
[75.00%] ··· dwt_benchmarks.Dwt2TimeSuite.time_dwt2                                                                                                                                                                                                         ok
[75.00%] ··· ====== ============ ============
             --              wavelet         
             ------ -------------------------
               n        haar         db4     
             ====== ============ ============
               16     54.8±1μs     62.9±2μs  
               64     88.6±1μs     114±3μs   
              100    138±0.7μs     179±1μs   
              128     224±1μs      288±1μs   
              192     425±2μs      539±6μs   
              256    1.03±0.2ms   1.44±0.2ms 
              1024    22.6±2ms     24.6±1ms  
              4096    779±4ms      810±1ms   
             ====== ============ ============

[75.00%] · For pywt commit b8ddd4d8 <main> (round 2/2):
[75.00%] ·· Building for virtualenv-py3.14-Cython-meson-python-numpy..
[75.00%] ·· Benchmarking virtualenv-py3.14-Cython-meson-python-numpy
[100.00%] ··· dwt_benchmarks.Dwt2TimeSuite.time_dwt2                                                                                                                                                                                                         ok
[100.00%] ··· ====== ============ ============
              --              wavelet         
              ------ -------------------------
                n        haar         db4     
              ====== ============ ============
                16    35.7±0.2μs   41.2±0.1μs 
                64     68.0±2μs     93.5±1μs  
               100    112±0.3μs    162±0.7μs  
               128     195±1μs      268±1μs   
               192    376±0.8μs     521±4μs   
               256    1.15±0.2ms   1.41±0.2ms 
               1024   21.7±0.7ms    25.0±1ms  
               4096    777±10ms     819±7ms   
              ====== ============ ============

| Change   | Before [b8ddd4d8] <main>   | After [bb4e0e3d] <bench-limited-api>   |   Ratio | Benchmark (Parameter)                               |
|----------|----------------------------|----------------------------------------|---------|-----------------------------------------------------|
| +        | 35.7±0.2μs                 | 54.8±1μs                               |    1.54 | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(16, 'haar')  |
| +        | 41.2±0.1μs                 | 62.9±2μs                               |    1.53 | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(16, 'db4')   |
| +        | 68.0±2μs                   | 88.6±1μs                               |    1.3  | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(64, 'haar')  |
| +        | 112±0.3μs                  | 138±0.7μs                              |    1.23 | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(100, 'haar') |
| +        | 93.5±1μs                   | 114±3μs                                |    1.22 | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(64, 'db4')   |
| +        | 195±1μs                    | 224±1μs                                |    1.15 | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(128, 'haar') |
| +        | 376±0.8μs                  | 425±2μs                                |    1.13 | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(192, 'haar') |
| +        | 162±0.7μs                  | 179±1μs                                |    1.1  | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(100, 'db4')  |
| +        | 268±1μs                    | 288±1μs                                |    1.08 | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(128, 'db4')  |

The benchmarks are old and seem to be written for fast runtime rather than being realistic - I wouldn't expect a size 16 or 16x16 transform to be useful. For images, I'd say 256x256 to 4096x4096 would be most relevant, maybe 128x128 for too.

rgommers · 2026-01-20T05:47:32Z

More worryingly, the new abi3 CI job showed a hang on macOS, after:

 tests/test_perfect_reconstruction.py .                                   [ 93%]

That should be in the first test in test_swt.py.

Error annotation:

The hosted runner lost communication with the server. Anything in your workflow that
terminates the runner process, starves it for CPU/Memory, or blocks its network access
can cause this error

This will build an `abi3` wheel per platform, which can be used for multiple (with-GIL) Python interpreter versions.

rgommers · 2026-01-21T09:58:18Z

the new abi3 CI job showed a hang on macOS

I haven't been able to reproduce that, however it would be explained by the single issue that UBSan flagged (now fixed), see gh-836.

rgommers · 2026-01-21T15:49:45Z

The 1-D dwt benchmarks look slower. I wrote a new one to focus on just length, gives the difference with different wavelengths and Modes is small:

class DwtSpanLengthsTimeSuite:
    params = ([16, 101, 256, 1024, 4096, 16384, 65536],
              ['haar', 'sym8'])
    param_names = ('n', 'wavelet')

    def setup(self, n, wavelet):
        self.data = np.ones(n, dtype=np.float64)

    def time_dwt(self, n, wavelet):
        pywt.dwt(self.data, wavelet, Modes.symmetric)

Result:

$ pixi r bench --compare -t DwtSpanLengthsTimeSuite
✨ Pixi task (bench in default): spin bench --compare -t DwtSpanLengthsTimeSuite                                                                                                                                                      $ cd benchmarks
$ asv continuous --factor 1.05 --bench DwtSpanLengthsTimeSuite c73b3aae7eec48c770ea72bf0017ffca086cf0c1 467212dd4bf39bbd964c6ce743e102f8b9d2ca6c
· `wheel_cache_size` has been renamed to `build_cache_size`. Update your `asv.conf.json` accordingly.
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.12-Cython-meson-python-numpy
·· Installing 467212dd <use-limited-api> into virtualenv-py3.12-Cython-meson-python-numpy..
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For pywt commit c73b3aae <main> (round 1/2):
[ 0.00%] ·· Building for virtualenv-py3.12-Cython-meson-python-numpy..
[ 0.00%] ·· Benchmarking virtualenv-py3.12-Cython-meson-python-numpy
[25.00%] ··· Running (dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt--).
[25.00%] · For pywt commit 467212dd <use-limited-api> (round 1/2):
[25.00%] ·· Building for virtualenv-py3.12-Cython-meson-python-numpy..
[25.00%] ·· Benchmarking virtualenv-py3.12-Cython-meson-python-numpy
[50.00%] ··· Running (dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt--).
[50.00%] · For pywt commit 467212dd <use-limited-api> (round 2/2):
[50.00%] ·· Benchmarking virtualenv-py3.12-Cython-meson-python-numpy
[75.00%] ··· dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt                                                                                                                                                                       ok
[75.00%] ··· ======= ============ =============
             --               wavelet          
             ------- --------------------------
                n        haar          sym8    
             ======= ============ =============
                16    4.35±0.1μs   4.80±0.06μs 
               101    4.64±0.2μs    5.48±0.1μs 
               256    5.24±0.2μs    7.09±0.2μs 
               1024   7.56±0.2μs    13.3±0.1μs 
               4096   15.6±0.2μs    37.9±0.4μs 
              16384   48.2±0.6μs    133±0.6μs  
              65536    188±3μs       537±5μs   
             ======= ============ =============

[75.00%] · For pywt commit c73b3aae <main> (round 2/2):
[75.00%] ·· Building for virtualenv-py3.12-Cython-meson-python-numpy..
[75.00%] ·· Benchmarking virtualenv-py3.12-Cython-meson-python-numpy
[100.00%] ··· dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt                                                                                                                                                                       ok
[100.00%] ··· ======= ============= =============
              --                wavelet          
              ------- ---------------------------
                 n         haar          sym8    
              ======= ============= =============
                 16    3.33±0.02μs    3.76±0.1μs 
                101    3.58±0.04μs   4.56±0.04μs 
                256    4.15±0.01μs   5.99±0.05μs 
                1024   6.30±0.04μs   12.0±0.07μs 
                4096   13.5±0.05μs   36.4±0.08μs 
               16384    42.6±0.5μs    132±0.8μs  
               65536     171±4μs       531±7μs   
              ======= ============= =============

| Change   | Before [c73b3aae] <main>   | After [467212dd] <use-limited-api>   |   Ratio | Benchmark (Parameter)                                          |
|----------|----------------------------|--------------------------------------|---------|----------------------------------------------------------------|
| +        | 3.33±0.02μs                | 4.35±0.1μs                           |    1.31 | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(16, 'haar')    |
| +        | 3.58±0.04μs                | 4.64±0.2μs                           |    1.3  | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(101, 'haar')   |
| +        | 3.76±0.1μs                 | 4.80±0.06μs                          |    1.28 | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(16, 'sym8')    |
| +        | 4.15±0.01μs                | 5.24±0.2μs                           |    1.26 | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(256, 'haar')   |
| +        | 4.56±0.04μs                | 5.48±0.1μs                           |    1.2  | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(101, 'sym8')   |
| +        | 6.30±0.04μs                | 7.56±0.2μs                           |    1.2  | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(1024, 'haar')  |
| +        | 5.99±0.05μs                | 7.09±0.2μs                           |    1.18 | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(256, 'sym8')   |
| +        | 13.5±0.05μs                | 15.6±0.2μs                           |    1.16 | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(4096, 'haar')  |
| +        | 42.6±0.5μs                 | 48.2±0.6μs                           |    1.13 | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(16384, 'haar') |
| +        | 12.0±0.07μs                | 13.3±0.1μs                           |    1.11 | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(1024, 'sym8')  |
| +        | 171±4μs                    | 188±3μs                              |    1.1  | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(65536, 'haar') |

The conclusion is similar though: really small data sees larger regressions, while from 500 us or so it doesn't matter too much. 1-D transforms are just faster, so the relative decrease in performance is larger.

We may just want to keep this for testing, or decide that the reduction in wheel builds is worth it and do PyPI releases with abi3. My own motivation is also to test abi3, given support in Cython is relatively new and it's a lot easier to test with this package than with a larger one. So wider usage has value, and can also be reverted if it turns out to matter in the real world.

@grlee77 do you have any thoughts here on how much performance matters for small data and operations that are <1 ms?

rgommers added enhancement build Official binaries labels Oct 19, 2025

rgommers mentioned this pull request Oct 19, 2025

CI Generate abi3 wheels scikit-learn/scikit-learn#32532

Draft

2 tasks

rgommers force-pushed the use-limited-api branch from a025157 to ddbcd62 Compare January 19, 2026 21:45

rgommers mentioned this pull request Jan 20, 2026

CI/MAINT: add a CI job with ASan and UBSan on macOS, fix one UB issue #836

Merged

rgommers added 4 commits January 20, 2026 13:27

BLD: use the CPython Limited C API, and hence the Stable ABI

37a7cf9

This will build an `abi3` wheel per platform, which can be used for multiple (with-GIL) Python interpreter versions.

CI: update release job for Stable ABI usage

3dedabd

CI: add Stable ABI jobs, build under py312, test under py314

6c4fb09

DEBUG: default to stable ABI for testing purposes

467212d

rgommers force-pushed the use-limited-api branch from ddbcd62 to 467212d Compare January 20, 2026 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: use limited C API to produce abi3 wheels for with-GIL interpreters in release builds #828

ENH: use limited C API to produce abi3 wheels for with-GIL interpreters in release builds #828

rgommers commented Oct 19, 2025

Uh oh!

rgommers commented Jan 19, 2026

Uh oh!

rgommers commented Jan 20, 2026 •

edited

Loading

Uh oh!

rgommers commented Jan 21, 2026

Uh oh!

rgommers commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

ENH: use limited C API to produce abi3 wheels for with-GIL interpreters in release builds #828

Are you sure you want to change the base?

ENH: use limited C API to produce abi3 wheels for with-GIL interpreters in release builds #828

Conversation

rgommers commented Oct 19, 2025

Uh oh!

rgommers commented Jan 19, 2026

Uh oh!

rgommers commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgommers commented Jan 21, 2026

Uh oh!

rgommers commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rgommers commented Jan 20, 2026 •

edited

Loading