Conversation
Bench: 12044152
Bench: 12044152
Bench: 12044152
Bench: 12044152
|
Looking much better |
Bench: 12044152
Bench: 12044152
|
So, we are using Array of Structures for
We might be missing About the Node Design; We absolutely cannot change Also since |
|
If it's a lot more efficient hopefully we won't need microbatching. |
Good catch, implemented (though with a different scheme) with new commit, decent speedup recorded (see the pr comment).
It is optimizable sure, but thats a problem only for the first batch of the first epoch. Furthermore, we dont really know exactly how much space we need on the tape, so i would leave it as is. I added a bit of reserve just to help a bit anyways.
Absolutely. It needs also redesign to support generic sized input operations.
It doesnt?
Yes. |
|
So much better 😄 |
JonathanHallstrom
left a comment
There was a problem hiding this comment.
Looks good, just a few small things
Bench: 11856625
This rewrite of the tuning system brings a huge speedup by:
It's still probably very optimizable, and during the rewrite i removed two things that definitely will need to be reimplemented:
The rewrite also hopes to catch some stray bug somewhere.
The Node design probably needs to be redone for better cache performance and alignment.
Feedback welcome and needed.
=================================
🚀 Performance Tracking
Machine: Ryzen 7 5800X
Dataset: v2.1 + v2.2 + v3 + dfrcv0 + dfrcv1
Metric: Average epoch runtime over 8 epochs
Baseline
Base: 83.5055 s/epoch
📈 Speedup Progression
Node+alignas(16)std::unreachable🏁 Current Best
6.8646 s/epoch (12.16× faster than baseline)
Bench: 12044152