A compact CUDA C++ project implementing a 5×5 image blur with shared memory tiling, constant memory, and explicit halo handling, optimized for Windows + WSL2 workflows.
- CUDA 5×5 convolution (normalized box filter)
- Shared memory tile + halo for efficient stencil operations
- Constant memory for fast kernel coefficient broadcast
- CMake build, VS Code debug, CTest integration
- Status-check script for environment + runtime validation
- Impulse-response verification (~0.04 center value)
Shared Memory Tile (TILE + 2R = 20 x 20)
0 1 2 3 ... 18 19 <-- threadIdx.x
+---+---+---+---+---+---+---+---+---+---+
0 | H | H | H | H | H | H | H | H | H | H |
+---+---+---+---+---+---+---+---+---+---+
1 | H | H | H | H | H | H | H | H | H | H |
+---+---+---+---+---+---+---+---+---+---+
2 | H | H | T | T | T | T | T | … | H | H |
+---+---+---+---+---+---+---+---+---+---+
3 | H | H | T | T | T | T | T | … | H | H |
+---+---+---+---+---+---+---+---+---+---+
. (16×16 compute region)
.
17| H | H | T | T | T | T | T | … | H | H |
+---+---+---+---+---+---+---+---+---+---+
18| H | H | H | H | H | H | H | H | H | H |
+---+---+---+---+---+---+---+---+---+---+
19| H | H | H | H | H | H | H | H | H | H |
+---+---+---+---+---+---+---+---+---+---+
Legend:
H = Halo pixel (loaded but not producing output)
T = Tile pixel (used for output)
Global Memory Shared Memory Compute
(Full Image) (Tile + Halo) (Output Tile)
+--------------+ +--------------------------+ +-------------------+
| | | H H H H H H H H H H ... | | O O O O O O O O |
| Image | -----> | H T T T T T T T T H ... | -----> | O O O O O O O O |
| | load | H T T T T T T T T H ... | K*K | O O O O O O O O |
+--------------+ | ... (20x20) | | (16x16 tile) |
+--------------------------+ +-------------------+
↑ ↑ ↑
| | |
| cooperative load write results
| (blockDim.x * blockDim.y) to global memory
__constant__ float d_kernel[25];Small, read-only, warp-broadcasted kernel coefficients.
__shared__ float tile[TILE + 2*R][TILE + 2*R];Shared tile + halo enabling efficient reuse of image data.
- Removes loop overhead
- Allows instruction-level optimization
- Ideal for small fixed-size kernels such as 5×5
int ox = blockIdx.x * TILE + threadIdx.x - R;int x = max(0, min(W - 1, ox));if (threadIdx.x >= R && threadIdx.x < TILE + R)int outx = blockIdx.x * TILE + (threadIdx.x - R);cpp-cuda-image-filter/
├── src/
│ └── image_filter.cu
├── build/
├── .vscode/
│ └── launch.json
├── check_cuda_image_filter_status.sh
├── CMakeLists.txt
├── .gitignore
└── README.md
Run:
./check_cuda_image_filter_status.shIt will:
- Print CUDA Toolkit + GPU info
- Validate presence of the compiled binary
- Run the blur kernel
- Parse center value and verify ≈ 0.04
- Report timing information
| Component | Recommended Version |
|---|---|
| CUDA Toolkit | 12.8+ |
| CMake | ≥ 3.24 |
| Compiler | NVCC + GCC |
| GPU | RTX 4070 SUPER (SM 8.9) |
| OS | Windows 11 + WSL2 (Ubuntu 22.04) |
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j./build/conv2d_sharedExpected output:
Center value after blur: 0.040000 (expected ~0.04)
Success!
cd build
ctest./check_cuda_image_filter_status.sh{
"name": "Run image_filter",
"type": "cppdbg",
"request": "launch",
"program": "${workspaceFolder}/build/image_filter",
"cwd": "${workspaceFolder}",
"MIMode": "gdb"
}- CUDA by Example — Sanders & Kandrot
- NVIDIA CUDA Toolkit Documentation
- CMake CUDA Language Guide
Samuel Huang
Toronto, ON
GitHub: https://github.com/FlosMume