Skip to content

Commit 90d39e5

Browse files
authored
Add Max's gsoc final presentation (#343)
* Add Max's gsoc final presentation * add blogpost
1 parent 533cbe1 commit 90d39e5

File tree

4 files changed

+78
-1
lines changed

4 files changed

+78
-1
lines changed

_data/crconlist2025.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@
4141
that detects possible data-race conditions that would enable reducing atomic
4242
operations in the Clad-produced code.
4343
44-
# slides: /assets/presentations/...
44+
slides: /assets/presentations/Maksym_Andriichuk_final_gsoc_atomic.pdf
4545

4646
- title: "Enable automatic differentiation of OpenMP programs with Clad"
4747
speaker:

_data/standing_meetings.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@
33
time_cest: "17:00"
44
connect: "[Link to zoom](https://princeton.zoom.us/j/97915651167?pwd=MXJ1T2lhc3Z5QWlYbUFnMTZYQlNRdz09)"
55
agenda:
6+
- title: "Summary: Activity analysis for reverse-mode differentiation of (CUDA) GPU kernels"
7+
date: 2025-11-13 15:30:00 +0200
8+
speaker: "Maksym Andriichuk"
9+
link: "[Slides](/assets/presentations/Maksym_Andriichuk_final_gsoc_atomic.pdf)"
610
- title: "Wrap-Up: Implement and improve an efficient, layered tape with prefetching capabilities"
711
date: 2025-10-30 15:40:00 +0200
812
speaker: "Aditi Milind Joshi"
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
---
2+
title: "Activity analysis for reverse-mode differentiation of (CUDA) GPU kernels"
3+
layout: post
4+
excerpt: "A summary of my GSoC 2025 project focusing on activity analysis for reverse-mode differentiation of (CUDA) GPU kernels."
5+
sitemap: true
6+
author: Maksym Andriichuk
7+
permalink: blogs/gsoc25_andriichuk_final_blog/
8+
banner_image: /images/blog/gsoc-clad-banner.png
9+
date: 2025-14-11
10+
tags: gsoc clad cuda clang c++
11+
---
12+
13+
**Mentors:** Vassil Vassilev, David Lange
14+
15+
## A Brief Introduction
16+
17+
### Main idea
18+
19+
Over a year ago, we added support for differentiating CUDA kernels using Clad. Read more on that [here](https://compiler-research.org/blogs/gsoc24_christina_koutsou_project_final_blog/). We introduced atomic operations in Clad to prevent race conditions that frequently appear because of how Clad handles statements like ```x=y``` in the reverse mode. Since atomic operations are inefficient, we aim to remove them whenever we are sure no race condition occurs.
20+
21+
Another part of my GSoC project was to unify Varied and TBR analyses in how they store information during the analysis run. This would make the implementation of future analyses easier and remove even more adjoints, since Varied Analysis does not account for variable reassignments.
22+
23+
## Project Implementation
24+
25+
### 1. Removing atomic operations
26+
27+
Consider the code below:
28+
29+
```cpp
30+
__global__ void kernel_call(double *out, double *in) {
31+
int index = threadIdx.x + blockIdx.x * blockDim.x;
32+
out[index] = in[index];
33+
}
34+
}
35+
void fn(double *out, double *in) {
36+
kernel_call<<<1, 16>>>(out, in);
37+
}
38+
```
39+
40+
The adjoint that corresponds to ```out[index] = in[index]``` is:
41+
42+
```cpp
43+
{
44+
out[index0] = _t2;
45+
double _r_d0 = _d_out[index0];
46+
_d_out[index0] = 0.;
47+
atomicAdd(&_d_in[index], _r_d0);
48+
}
49+
```
50+
51+
Notice that in this case index is ```injective```, meaning no two threads from any two blocks have the same value of index. This means that when writing to ```_d_in[index]```, no two threads would be able to write to the same memory at the same time.
52+
53+
The implementation involves two static analyzers: one checks whether an index matches some particular form, and the other checks if it was not changed later. The hardest part is accounting for all possible term permutations of, say, ```threadIdx.x + blockIdx.x * blockDim.x``` and for expressions that depend on index linearly, i.e., ```2*index+1```.
54+
55+
### 2. Varied Analysis
56+
57+
The implementation looked very straightforward at first but turned out to be harder. Since the new infrastructure is more detailed, the analyses had to be improved. The tricky parts were supporting variable reassignments and loop handling. Support for pointers and OOP was added, and the analysis was enabled on all gradient tests numerically, which makes it almost default. However, there are more things to be done to produce even less code.
58+
59+
### 3. Benchmarks
60+
61+
To compare how much difference the analysis makes, we used the LULESH benchmark. The difference in execution time was about 5% across all problem sizes, which is pretty good for an analysis this small.
62+
63+
In trivial cases like the ```kernel_call``` function above, we got up to 5x speedup with a given number of blocks/threads.
64+
65+
## Future Work
66+
67+
- Adding more capabilities to the Varied Analysis
68+
- Adding more indices to consider injective
69+
70+
## Related Links
71+
72+
- [Clad Repository](https://github.com/vgvassilev/clad)
73+
- [My GitHub Profile](https://github.com/ovdiiuv)
265 KB
Binary file not shown.

0 commit comments

Comments
 (0)