Optimize op_conv_vef_face kernel#12
Conversation
|
Thanks Paul, I will have a look monday. |
|
For Op_Conv_VEF_Face kernel, we notice between 18% and 34% speedup (Nvidia A6000) according our GPU test cases. And strangely, it seems slower on A100... I merge your code into a local branch here cause the pattern is very interesting and that because now thanks to your work, we know that local static array is not using register but global memory and here replaced by faster scratch memory. The code can switch on the two implementations (with and wo scratch memory), by a TRUST_USE_SCRATCH_MEMORY environment variable to test. |
|
I add Adrien and Rémi to discuss about the benefice/complexity ratio introduced by using scratch memory. To give an idea 30% speedup is the probable gain by using the good layout on this kernel. What bothers me, for example, is the size of the warps set here to 32. Does this value GPU specific, is it the same on AMD, and what if in 10 years with future GPU cards ? Kokkos provide portability of performance, and in my poor understanding, developer should not care about this value. According to Hari tests, scratch memory through hierarchical memory is not interesting on other kernels (like diffusion one). |
|
On MI250X AMD, the slowdown with scratch memory is between 7% and 25% (warp size 64?). |
Remove optimization of scratch memory size for order 3.
This PR aims to optimize the large convective kernel in
src/VEF/Operateurs/Op_Conv/Op_Conv_VEF_Face.cpp. It replaces temporary arrays in local memory by Kokkos views in scratch memory.