Added spill to shared and launch bounds#16
Conversation
gimmik/kernels/cuda/bstream.mako
Outdated
| const ${dtype}* __restrict__ b, int ldb, | ||
| ${dtype}* __restrict__ c, int ldc) | ||
| { | ||
| #if ( ( defined(__CUDACC_VER_MAJOR__) && ( __CUDACC_VER_MAJOR__ >= 13 ) ) || \ |
There was a problem hiding this comment.
When would CUDACC_VER_MAJOR not be defined?
There was a problem hiding this comment.
No, I can't think of a time when those wouldn't be defined when compiled with Nvidia tools. But there are some third-party tools, like SCALE, that claim to be able to compile CUDA for other accelerators, and I have no idea for those. So I thought it was good practice to check if they exist first.
There was a problem hiding this comment.
I think we can just check directly. Also do we need to care about CUDA 12? Seems easier to just require 13 or later.
There was a problem hiding this comment.
Ok, changed this to just cuda 13.
| const ${dtype}* __restrict__ b, int ldb, | ||
| ${dtype}* __restrict__ c, int ldc) | ||
| { | ||
| #if ( __CUDACC_VER_MAJOR__ >= 13 ) |
There was a problem hiding this comment.
Does it have to go at the start of a function or can we move it down after the variable declarations so that it only needs to appear once?
There was a problem hiding this comment.
No, it needs to come first.
|
Here is the FP64 performance improvement in % for N = 48^3.
|
|
Do you have absolute numbers so peak of FLOPs/bandwdith we achieve? |
Added spilling to shared for the two kernels that don't already use shared memory. This feature requires cuda >= 12.9.
Additionally, I added launch bounds to the cuda kernels. This generally gives a boost to performance, but especially helps when spilling to shared.