Skip to content

Conversation

@rlplays
Copy link

@rlplays rlplays commented Feb 2, 2026

Make DEBUG=1 builds work. Adds Torch error catching (to DEBUG mode).
Looks like this:

Error from libtorch: The size of tensor a (8192) must match the size of tensor b (4096) at non-singleton dimension 0
Exception raised from infer_size_impl at /pytorch/aten/src/ATen/ExpandUtils.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x747230129fdd in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xc3 (0x7472300bf561 in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #2: at::infer_size_dimvector(c10::ArrayRef<long>, c10::ArrayRef<long>) + 0x404 (0x74718fc9edd4 in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame #3: at::TensorIteratorBase::compute_shape(at::TensorIteratorConfig const&) + 0x110 (0x74718fd438b0 in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame #4: at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 0x59 (0x74718fd48ce9 in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x1e1e2b7 (0x74719016a2b7 in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x57 (0x74719016c917 in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x692dddc (0x747194c79ddc in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x6931b8f (0x747194c7db8f in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x18a (0x747190fdaeaa in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0xd3ffe (0x747107beeffe in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame #11: <unknown function> + 0x101097 (0x747107c1c097 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame #12: <unknown function> + 0x10476b (0x747107c1f76b in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame #13: <unknown function> + 0x17d012 (0x747107c98012 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame #14: <unknown function> + 0x15e093 (0x747107c79093 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame #15: <unknown function> + 0x13e98f (0x747107c5998f in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame #16: std::function<void ()>::operator()() const + 0x36 (0x747107c2ec10 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame #17: <unknown function> + 0x103668 (0x747107c1e668 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame #18: <unknown function> + 0x10589a (0x747107c2089a in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame #19: <unknown function> + 0x109023 (0x747107c24023 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame #20: <unknown function> + 0x199dfb (0x747107cb4dfb in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame #21: <unknown function> + 0x180921 (0x747107c9b921 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame #22: <unknown function> + 0x163a79 (0x747107c7ea79 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame #23: <unknown function> + 0x163da3 (0x747107c7eda3 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame #24: <unknown function> + 0xcceae (0x747107be7eae in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
<omitting python frames>
frame #42: <unknown function> + 0x2a1ca (0x747272c721ca in /lib/x86_64-linux-gnu/libc.so.6)
frame #43: __libc_start_main + 0x8b (0x747272c7228b in /lib/x86_64-linux-gnu/libc.so.6)

Illegal instruction (core dumped)

This is particularly important for the individual threads whose exceptions won't be caught by the Python main thread catch-all.

To build the DEBUG version: DEBUG=1 python setup.py build_torch --force --inplace && DEBUG=1 python setup.py build_breakout --force --inplace

(TODO: Must add back NO_ASAN=1 too once ASAN is impl'ed)

Looks like this:

```
*Error from libtorch: The size of tensor a (8192) must match the size of tensor b (4096) at non-singleton dimension 0
Exception raised from infer_size_impl at /pytorch/aten/src/ATen/ExpandUtils.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x747230129fdd in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libc10.so)
frame PufferAI#1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xc3 (0x7472300bf561 in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libc10.so)
frame PufferAI#2: at::infer_size_dimvector(c10::ArrayRef<long>, c10::ArrayRef<long>) + 0x404 (0x74718fc9edd4 in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame PufferAI#3: at::TensorIteratorBase::compute_shape(at::TensorIteratorConfig const&) + 0x110 (0x74718fd438b0 in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame PufferAI#4: at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 0x59 (0x74718fd48ce9 in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame PufferAI#5: <unknown function> + 0x1e1e2b7 (0x74719016a2b7 in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame PufferAI#6: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x57 (0x74719016c917 in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame PufferAI#7: <unknown function> + 0x692dddc (0x747194c79ddc in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame PufferAI#8: <unknown function> + 0x6931b8f (0x747194c7db8f in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame PufferAI#9: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x18a (0x747190fdaeaa in /home/peru/repo/thirdparty/puffer/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so)
frame PufferAI#10: <unknown function> + 0xd3ffe (0x747107beeffe in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame PufferAI#11: <unknown function> + 0x101097 (0x747107c1c097 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame PufferAI#12: <unknown function> + 0x10476b (0x747107c1f76b in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame PufferAI#13: <unknown function> + 0x17d012 (0x747107c98012 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame PufferAI#14: <unknown function> + 0x15e093 (0x747107c79093 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame PufferAI#15: <unknown function> + 0x13e98f (0x747107c5998f in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame PufferAI#16: std::function<void ()>::operator()() const + 0x36 (0x747107c2ec10 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame PufferAI#17: <unknown function> + 0x103668 (0x747107c1e668 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame PufferAI#18: <unknown function> + 0x10589a (0x747107c2089a in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame PufferAI#19: <unknown function> + 0x109023 (0x747107c24023 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame PufferAI#20: <unknown function> + 0x199dfb (0x747107cb4dfb in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame PufferAI#21: <unknown function> + 0x180921 (0x747107c9b921 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame PufferAI#22: <unknown function> + 0x163a79 (0x747107c7ea79 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame PufferAI#23: <unknown function> + 0x163da3 (0x747107c7eda3 in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
frame PufferAI#24: <unknown function> + 0xcceae (0x747107be7eae in /home/peru/repo/thirdparty/PufferLib/pufferlib/_C.cpython-313-x86_64-linux-gnu.so)
<omitting python frames>
frame PufferAI#42: <unknown function> + 0x2a1ca (0x747272c721ca in /lib/x86_64-linux-gnu/libc.so.6)
frame PufferAI#43: __libc_start_main + 0x8b (0x747272c7228b in /lib/x86_64-linux-gnu/libc.so.6)

Illegal instruction (core dumped)
```

This is particularly important for the individual threads that won't be caught by the Python main thread catch-all.
@rlplays rlplays changed the title 4.0 a Make DEBUG=1 builds work. Adds Torch error catching (to DEBUG mode). Feb 2, 2026
@rlplays rlplays changed the title Make DEBUG=1 builds work. Adds Torch error catching (to DEBUG mode). Make DEBUG=1 builds work + add DEBUG Torch error catching Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants