This repository explores the implementation and optimization of 2D convolution using CUDA. The project compares a sequential CPU implementation against two GPU variants: a standard global memory approach and an optimized version utilizing shared memory tiling and constant memory.
The following measurements were obtained on a Tesla T4 GPU processing a
| Implementation | Tile Size | Time Elapsed (ms) |
|---|---|---|
| CPU (Sequential) | N/A | 61.912 ms |
| GPU (Without Tiling) | N/A | 0.283776 ms |
| GPU (With Tiling) | 4 | 0.473152 ms |
| GPU (With Tiling) | 8 | 0.103808 ms |
| GPU (With Tiling) | 16 | 0.096224 ms |
| GPU (With Tiling) | 32 | 0.102304 ms |
The __constant__ memory space. This choice was made because the mask is read-only and accessed by all threads, allowing for efficient broadcasting and reduced global memory accesses.
To optimize performance, the input matrix is divided into tiles. Data is loaded into __shared__ memory to allow threads within a block to reuse data, significantly reducing redundant reads from global memory.
A dedicated get_element device function handles boundary conditions by returning 0.0f for out-of-bounds accesses, ensuring consistency across different kernel implementations.
- Streaming Multiprocessors (SMs): 40
- Peak Memory Bandwidth: 320.064 GB/s
- Max Threads Per Block: 1024
- Shared Memory Per Block: 49152 bytes
- Open
Convolution2D_Tiling.ipynbin Google Colab. - Ensure the Runtime Type is set to GPU.
- Execute the cells to install the
nvcc4jupyterplugin and run the benchmarks.
Navigate to the /src directory and use the NVIDIA CUDA Compiler (nvcc):
To run the tiled implementation:
nvcc -diag-suppress 20044 main_2D_convolution_tiling.cu convolution_2D_tiling.cu -o convolution_tiled
./convolution_tiledTo run the non-tiled implementation:
nvcc -diag-suppress 20044 main_2D_convolution.cu convolution_2D.cu convolution_2D_cpu.cpp -o convolution_standard
./convolution_standard