|
| 1 | +--- |
| 2 | +title: "Activity analysis for reverse-mode differentiation of (CUDA) GPU kernels" |
| 3 | +layout: post |
| 4 | +excerpt: "A summary of my GSoC 2025 project focusing on activity analysis for reverse-mode differentiation of (CUDA) GPU kernels." |
| 5 | +sitemap: true |
| 6 | +author: Maksym Andriichuk |
| 7 | +permalink: blogs/gsoc25_andriichuk_final_blog/ |
| 8 | +banner_image: /images/blog/gsoc-clad-banner.png |
| 9 | +date: 2025-14-11 |
| 10 | +tags: gsoc clad cuda clang c++ |
| 11 | +--- |
| 12 | + |
| 13 | +**Mentors:** Vassil Vassilev, David Lange |
| 14 | + |
| 15 | +## A Brief Introduction |
| 16 | + |
| 17 | +### Main idea |
| 18 | + |
| 19 | +Over a year ago, we added support for differentiating CUDA kernels using Clad. Read more on that [here](https://compiler-research.org/blogs/gsoc24_christina_koutsou_project_final_blog/). We introduced atomic operations in Clad to prevent race conditions that frequently appear because of how Clad handles statements like ```x=y``` in the reverse mode. Since atomic operations are inefficient, we aim to remove them whenever we are sure no race condition occurs. |
| 20 | + |
| 21 | +Another part of my GSoC project was to unify Varied and TBR analyses in how they store information during the analysis run. This would make the implementation of future analyses easier and remove even more adjoints, since Varied Analysis does not account for variable reassignments. |
| 22 | + |
| 23 | +## Project Implementation |
| 24 | + |
| 25 | +### 1. Removing atomic operations |
| 26 | + |
| 27 | +Consider the code below: |
| 28 | + |
| 29 | +```cpp |
| 30 | +__global__ void kernel_call(double *out, double *in) { |
| 31 | + int index = threadIdx.x + blockIdx.x * blockDim.x; |
| 32 | + out[index] = in[index]; |
| 33 | +} |
| 34 | +} |
| 35 | +void fn(double *out, double *in) { |
| 36 | + kernel_call<<<1, 16>>>(out, in); |
| 37 | +} |
| 38 | +``` |
| 39 | +
|
| 40 | +The adjoint that corresponds to ```out[index] = in[index]``` is: |
| 41 | +
|
| 42 | +```cpp |
| 43 | +{ |
| 44 | + out[index0] = _t2; |
| 45 | + double _r_d0 = _d_out[index0]; |
| 46 | + _d_out[index0] = 0.; |
| 47 | + atomicAdd(&_d_in[index], _r_d0); |
| 48 | +} |
| 49 | +``` |
| 50 | + |
| 51 | +Notice that in this case index is ```injective```, meaning no two threads from any two blocks have the same value of index. This means that when writing to ```_d_in[index]```, no two threads would be able to write to the same memory at the same time. |
| 52 | + |
| 53 | +The implementation involves two static analyzers: one checks whether an index matches some particular form, and the other checks if it was not changed later. The hardest part is accounting for all possible term permutations of, say, ```threadIdx.x + blockIdx.x * blockDim.x``` and for expressions that depend on index linearly, i.e., ```2*index+1```. |
| 54 | + |
| 55 | +### 2. Varied Analysis |
| 56 | + |
| 57 | +The implementation looked very straightforward at first but turned out to be harder. Since the new infrastructure is more detailed, the analyses had to be improved. The tricky parts were supporting variable reassignments and loop handling. Support for pointers and OOP was added, and the analysis was enabled on all gradient tests numerically, which makes it almost default. However, there are more things to be done to produce even less code. |
| 58 | + |
| 59 | +### 3. Benchmarks |
| 60 | + |
| 61 | +To compare how much difference the analysis makes, we used the LULESH benchmark. The difference in execution time was about 5% across all problem sizes, which is pretty good for an analysis this small. |
| 62 | + |
| 63 | +In trivial cases like the ```kernel_call``` function above, we got up to 5x speedup with a given number of blocks/threads. |
| 64 | + |
| 65 | +## Future Work |
| 66 | + |
| 67 | +- Adding more capabilities to the Varied Analysis |
| 68 | +- Adding more indices to consider injective |
| 69 | + |
| 70 | +## Related Links |
| 71 | + |
| 72 | +- [Clad Repository](https://github.com/vgvassilev/clad) |
| 73 | +- [My GitHub Profile](https://github.com/ovdiiuv) |
0 commit comments