Skip to content

Conversation

@rgommers
Copy link
Member

The default for a local build is still to produce cp3xx wheels; when passing a build flag it is now possible to opt into producing wheels for the Stable ABI. This reduces the number of wheels to build from 5 to 3 per platform.

TODO: benchmarking to ensure we don't lose a significant amount of performance.

@rgommers
Copy link
Member Author

After fixing the benchmarks in gh-835, I got around to running some benchmarks. It looks like for very small image sizes, there is quite a lot of overhead. For operations that take about 1 ms or more, the difference seems very small. Example:

$ pixi r bench --compare -t Dwt2TimeSuite
✨ Pixi task (bench in default): spin bench --compare -t Dwt2TimeSuite                                                                                                                                                                                         $ cd benchmarks
$ asv continuous --interleave-rounds --factor 1.05 --bench Dwt2TimeSuite b8ddd4d843ffcd82f8569e9988115d3a11d11120 bb4e0e3d59fb165266f68f277c443f73cdf944d5
· `wheel_cache_size` has been renamed to `build_cache_size`. Update your `asv.conf.json` accordingly.
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.14-Cython-meson-python-numpy
·· Installing bb4e0e3d <bench-limited-api> into virtualenv-py3.14-Cython-meson-python-numpy..
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For pywt commit b8ddd4d8 <main> (round 1/2):
[ 0.00%] ·· Building for virtualenv-py3.14-Cython-meson-python-numpy..
[ 0.00%] ·· Benchmarking virtualenv-py3.14-Cython-meson-python-numpy
[25.00%] ··· Running (dwt_benchmarks.Dwt2TimeSuite.time_dwt2--).
[25.00%] · For pywt commit bb4e0e3d <bench-limited-api> (round 1/2):
[25.00%] ·· Building for virtualenv-py3.14-Cython-meson-python-numpy..
[25.00%] ·· Benchmarking virtualenv-py3.14-Cython-meson-python-numpy
[50.00%] ··· Running (dwt_benchmarks.Dwt2TimeSuite.time_dwt2--).
[50.00%] · For pywt commit bb4e0e3d <bench-limited-api> (round 2/2):
[50.00%] ·· Benchmarking virtualenv-py3.14-Cython-meson-python-numpy
[75.00%] ··· dwt_benchmarks.Dwt2TimeSuite.time_dwt2                                                                                                                                                                                                         ok
[75.00%] ··· ====== ============ ============
             --              wavelet         
             ------ -------------------------
               n        haar         db4     
             ====== ============ ============
               16     54.8±1μs     62.9±2μs  
               64     88.6±1μs     114±3μs   
              100    138±0.7μs     179±1μs   
              128     224±1μs      288±1μs   
              192     425±2μs      539±6μs   
              256    1.03±0.2ms   1.44±0.2ms 
              1024    22.6±2ms     24.6±1ms  
              4096    779±4ms      810±1ms   
             ====== ============ ============

[75.00%] · For pywt commit b8ddd4d8 <main> (round 2/2):
[75.00%] ·· Building for virtualenv-py3.14-Cython-meson-python-numpy..
[75.00%] ·· Benchmarking virtualenv-py3.14-Cython-meson-python-numpy
[100.00%] ··· dwt_benchmarks.Dwt2TimeSuite.time_dwt2                                                                                                                                                                                                         ok
[100.00%] ··· ====== ============ ============
              --              wavelet         
              ------ -------------------------
                n        haar         db4     
              ====== ============ ============
                16    35.7±0.2μs   41.2±0.1μs 
                64     68.0±2μs     93.5±1μs  
               100    112±0.3μs    162±0.7μs  
               128     195±1μs      268±1μs   
               192    376±0.8μs     521±4μs   
               256    1.15±0.2ms   1.41±0.2ms 
               1024   21.7±0.7ms    25.0±1ms  
               4096    777±10ms     819±7ms   
              ====== ============ ============

| Change   | Before [b8ddd4d8] <main>   | After [bb4e0e3d] <bench-limited-api>   |   Ratio | Benchmark (Parameter)                               |
|----------|----------------------------|----------------------------------------|---------|-----------------------------------------------------|
| +        | 35.7±0.2μs                 | 54.8±1μs                               |    1.54 | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(16, 'haar')  |
| +        | 41.2±0.1μs                 | 62.9±2μs                               |    1.53 | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(16, 'db4')   |
| +        | 68.0±2μs                   | 88.6±1μs                               |    1.3  | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(64, 'haar')  |
| +        | 112±0.3μs                  | 138±0.7μs                              |    1.23 | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(100, 'haar') |
| +        | 93.5±1μs                   | 114±3μs                                |    1.22 | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(64, 'db4')   |
| +        | 195±1μs                    | 224±1μs                                |    1.15 | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(128, 'haar') |
| +        | 376±0.8μs                  | 425±2μs                                |    1.13 | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(192, 'haar') |
| +        | 162±0.7μs                  | 179±1μs                                |    1.1  | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(100, 'db4')  |
| +        | 268±1μs                    | 288±1μs                                |    1.08 | dwt_benchmarks.Dwt2TimeSuite.time_dwt2(128, 'db4')  |

The benchmarks are old and seem to be written for fast runtime rather than being realistic - I wouldn't expect a size 16 or 16x16 transform to be useful. For images, I'd say 256x256 to 4096x4096 would be most relevant, maybe 128x128 for too.

@rgommers
Copy link
Member Author

rgommers commented Jan 20, 2026

More worryingly, the new abi3 CI job showed a hang on macOS, after:

 tests/test_perfect_reconstruction.py .                                   [ 93%]

That should be in the first test in test_swt.py.

Error annotation:

The hosted runner lost communication with the server. Anything in your workflow that
terminates the runner process, starves it for CPU/Memory, or blocks its network access
can cause this error

@rgommers
Copy link
Member Author

the new abi3 CI job showed a hang on macOS

I haven't been able to reproduce that, however it would be explained by the single issue that UBSan flagged (now fixed), see gh-836.

@rgommers
Copy link
Member Author

The 1-D dwt benchmarks look slower. I wrote a new one to focus on just length, gives the difference with different wavelengths and Modes is small:

class DwtSpanLengthsTimeSuite:
    params = ([16, 101, 256, 1024, 4096, 16384, 65536],
              ['haar', 'sym8'])
    param_names = ('n', 'wavelet')

    def setup(self, n, wavelet):
        self.data = np.ones(n, dtype=np.float64)

    def time_dwt(self, n, wavelet):
        pywt.dwt(self.data, wavelet, Modes.symmetric)

Result:

$ pixi r bench --compare -t DwtSpanLengthsTimeSuite
✨ Pixi task (bench in default): spin bench --compare -t DwtSpanLengthsTimeSuite                                                                                                                                                      $ cd benchmarks
$ asv continuous --factor 1.05 --bench DwtSpanLengthsTimeSuite c73b3aae7eec48c770ea72bf0017ffca086cf0c1 467212dd4bf39bbd964c6ce743e102f8b9d2ca6c
· `wheel_cache_size` has been renamed to `build_cache_size`. Update your `asv.conf.json` accordingly.
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.12-Cython-meson-python-numpy
·· Installing 467212dd <use-limited-api> into virtualenv-py3.12-Cython-meson-python-numpy..
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For pywt commit c73b3aae <main> (round 1/2):
[ 0.00%] ·· Building for virtualenv-py3.12-Cython-meson-python-numpy..
[ 0.00%] ·· Benchmarking virtualenv-py3.12-Cython-meson-python-numpy
[25.00%] ··· Running (dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt--).
[25.00%] · For pywt commit 467212dd <use-limited-api> (round 1/2):
[25.00%] ·· Building for virtualenv-py3.12-Cython-meson-python-numpy..
[25.00%] ·· Benchmarking virtualenv-py3.12-Cython-meson-python-numpy
[50.00%] ··· Running (dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt--).
[50.00%] · For pywt commit 467212dd <use-limited-api> (round 2/2):
[50.00%] ·· Benchmarking virtualenv-py3.12-Cython-meson-python-numpy
[75.00%] ··· dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt                                                                                                                                                                       ok
[75.00%] ··· ======= ============ =============
             --               wavelet          
             ------- --------------------------
                n        haar          sym8    
             ======= ============ =============
                16    4.35±0.1μs   4.80±0.06μs 
               101    4.64±0.2μs    5.48±0.1μs 
               256    5.24±0.2μs    7.09±0.2μs 
               1024   7.56±0.2μs    13.3±0.1μs 
               4096   15.6±0.2μs    37.9±0.4μs 
              16384   48.2±0.6μs    133±0.6μs  
              65536    188±3μs       537±5μs   
             ======= ============ =============

[75.00%] · For pywt commit c73b3aae <main> (round 2/2):
[75.00%] ·· Building for virtualenv-py3.12-Cython-meson-python-numpy..
[75.00%] ·· Benchmarking virtualenv-py3.12-Cython-meson-python-numpy
[100.00%] ··· dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt                                                                                                                                                                       ok
[100.00%] ··· ======= ============= =============
              --                wavelet          
              ------- ---------------------------
                 n         haar          sym8    
              ======= ============= =============
                 16    3.33±0.02μs    3.76±0.1μs 
                101    3.58±0.04μs   4.56±0.04μs 
                256    4.15±0.01μs   5.99±0.05μs 
                1024   6.30±0.04μs   12.0±0.07μs 
                4096   13.5±0.05μs   36.4±0.08μs 
               16384    42.6±0.5μs    132±0.8μs  
               65536     171±4μs       531±7μs   
              ======= ============= =============

| Change   | Before [c73b3aae] <main>   | After [467212dd] <use-limited-api>   |   Ratio | Benchmark (Parameter)                                          |
|----------|----------------------------|--------------------------------------|---------|----------------------------------------------------------------|
| +        | 3.33±0.02μs                | 4.35±0.1μs                           |    1.31 | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(16, 'haar')    |
| +        | 3.58±0.04μs                | 4.64±0.2μs                           |    1.3  | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(101, 'haar')   |
| +        | 3.76±0.1μs                 | 4.80±0.06μs                          |    1.28 | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(16, 'sym8')    |
| +        | 4.15±0.01μs                | 5.24±0.2μs                           |    1.26 | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(256, 'haar')   |
| +        | 4.56±0.04μs                | 5.48±0.1μs                           |    1.2  | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(101, 'sym8')   |
| +        | 6.30±0.04μs                | 7.56±0.2μs                           |    1.2  | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(1024, 'haar')  |
| +        | 5.99±0.05μs                | 7.09±0.2μs                           |    1.18 | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(256, 'sym8')   |
| +        | 13.5±0.05μs                | 15.6±0.2μs                           |    1.16 | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(4096, 'haar')  |
| +        | 42.6±0.5μs                 | 48.2±0.6μs                           |    1.13 | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(16384, 'haar') |
| +        | 12.0±0.07μs                | 13.3±0.1μs                           |    1.11 | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(1024, 'sym8')  |
| +        | 171±4μs                    | 188±3μs                              |    1.1  | dwt_benchmarks.DwtSpanLengthsTimeSuite.time_dwt(65536, 'haar') |

The conclusion is similar though: really small data sees larger regressions, while from 500 us or so it doesn't matter too much. 1-D transforms are just faster, so the relative decrease in performance is larger.

We may just want to keep this for testing, or decide that the reduction in wheel builds is worth it and do PyPI releases with abi3. My own motivation is also to test abi3, given support in Cython is relatively new and it's a lot easier to test with this package than with a larger one. So wider usage has value, and can also be reverted if it turns out to matter in the real world.

@grlee77 do you have any thoughts here on how much performance matters for small data and operations that are <1 ms?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant