[ISSUE-815] Generate Random Numbers Asynchronously on the GPU#859
[ISSUE-815] Generate Random Numbers Asynchronously on the GPU#859AndrewBMadison wants to merge 41 commits intoSharedDevelopmentfrom
Conversation
…g to #include <cuda_runtime.h>
I have implemented all the changes you requested as well as renamed the stream used by all the synchronous kernels to simulationStream (simulationStream_ as a member variable). I have also added new documentation to the developer docs and linked it into index.md. The old MersenneTwister files are still there if anyone wanted to try it out again, but if you'd like I could remove them. |
stiber
left a comment
There was a problem hiding this comment.
Looks great; I will merge this.
|
I take it back; this needs to have SharedDevelopment merged into it. There may be a conflict with the changes to device memory allocation/deallocation being moved to |
stiber
left a comment
There was a problem hiding this comment.
Besides the comments below, need to examine GPUModel::allocEdgeIndexMap() and GPUModel::copyCPUtoGPU() to see if they need rewrites because of DeviceVector
…g the diff and reviewing changes to resolve conflicts.
|
I reverted and remerged SharedDevelopment into AndrewDevelopment, manually reviewing changes and deleting old unnecessary code from the SharedDevelopment commit before Ben merged his code in. This resolved a lot of the changes you requested, but I am unsure if the OperationManager should execute the copyCPUtoGPU so maybe you could ask Ben. I also moved the AsyncGenerator deletion as you requested. |
Closes #815
Description
Replaced the custom Mersenne Twister GPU kernel with an AsyncPhilox_d class that asynchronously fills GPU buffers with random noise using cuRAND's Philox generator. The class supports double-buffering and is designed for concurrent execution.
GPUModel initializes Philox states and fills two initial buffers via loadPhilox() on a member AsyncPhilox_d instance. During each advance() call, requestSegment() retrieves a float* slice from the currently active buffer, sized appropriately for each vertex and ready to be used in advanceVertices().
Once a buffer is consumed, fillBuffer() is triggered on the other buffer while the current one continues to serve slices. This ensures continuous data availability through double-buffering.
AsyncPhilox_d uses its own internal CUDA stream to launch fill kernels asynchronously. To enable true concurrency, all other compute kernels needed to also use non-default streams. This is necessary because stream 0 (the default stream) implicitly synchronizes with all other streams, preventing concurrent execution and causing the scheduler to serialize kernel launches even when they could run in parallel.
Checklist (Mandatory for new features)
Testing (Mandatory for all changes)
test-medium-connected.xmlPassedtest-large-long.xmlPassed