2D Convolution Challenge: CPU vs. GPU Performance

This repository explores the implementation and optimization of 2D convolution using CUDA. The project compares a sequential CPU implementation against two GPU variants: a standard global memory approach and an optimized version utilizing shared memory tiling and constant memory.

Experimental Results

The following measurements were obtained on a Tesla T4 GPU processing a $1024 \times 1024$ input matrix with a $3 \times 3$ mask.

Implementation	Tile Size	Time Elapsed (ms)
CPU (Sequential)	N/A	61.912 ms
GPU (Without Tiling)	N/A	0.283776 ms
GPU (With Tiling)	4	0.473152 ms
GPU (With Tiling)	8	0.103808 ms
GPU (With Tiling)	16	0.096224 ms
GPU (With Tiling)	32	0.102304 ms

Design & Optimizations

1. Constant Memory

The $3 \times 3$ convolution mask is stored in the __constant__ memory space. This choice was made because the mask is read-only and accessed by all threads, allowing for efficient broadcasting and reduced global memory accesses.

2. Tiling with Shared Memory

To optimize performance, the input matrix is divided into tiles. Data is loaded into __shared__ memory to allow threads within a block to reuse data, significantly reducing redundant reads from global memory.

3. Boundary Handling

A dedicated get_element device function handles boundary conditions by returning 0.0f for out-of-bounds accesses, ensuring consistency across different kernel implementations.

Hardware Specifications (Tesla T4)

Streaming Multiprocessors (SMs): 40
Peak Memory Bandwidth: 320.064 GB/s
Max Threads Per Block: 1024
Shared Memory Per Block: 49152 bytes

How to Run

Option 1: Jupyter Notebook (Google Colab)

Open Convolution2D_Tiling.ipynb in Google Colab.
Ensure the Runtime Type is set to GPU.
Execute the cells to install the nvcc4jupyter plugin and run the benchmarks.

Option 2: Standalone Compilation (Linux/CLI)

Navigate to the /src directory and use the NVIDIA CUDA Compiler (nvcc):

To run the tiled implementation:

nvcc -diag-suppress 20044 main_2D_convolution_tiling.cu convolution_2D_tiling.cu -o convolution_tiled
./convolution_tiled

To run the non-tiled implementation:

nvcc -diag-suppress 20044 main_2D_convolution.cu convolution_2D.cu convolution_2D_cpu.cpp -o convolution_standard
./convolution_standard

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
Convolution2D_Tiling.ipynb		Convolution2D_Tiling.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2D Convolution Challenge: CPU vs. GPU Performance

Experimental Results

Design & Optimizations

1. Constant Memory

2. Tiling with Shared Memory

3. Boundary Handling

Hardware Specifications (Tesla T4)

How to Run

Option 1: Jupyter Notebook (Google Colab)

Option 2: Standalone Compilation (Linux/CLI)

About

Uh oh!

Languages

nadahkhaledd/Parallel-Convolution-Tiling

Folders and files

Latest commit

History

Repository files navigation

2D Convolution Challenge: CPU vs. GPU Performance

Experimental Results

Design & Optimizations

1. Constant Memory

2. Tiling with Shared Memory

3. Boundary Handling

Hardware Specifications (Tesla T4)

How to Run

Option 1: Jupyter Notebook (Google Colab)

Option 2: Standalone Compilation (Linux/CLI)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages