Skip to content

Commit 533cbe1

Browse files
Genome Sequencing Project final Blog and presentation (#345)
* final blog and presentation Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> * added spellings Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> * alphabetical order Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> --------- Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
1 parent b678900 commit 533cbe1

File tree

6 files changed

+142
-5
lines changed

6 files changed

+142
-5
lines changed

.github/actions/spelling/allow/terms.txt

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,9 @@ Ohridski
3232
OMP
3333
OpenMP
3434
PTX
35+
QNAME
3536
RAII
37+
RNAME
3638
Resugaring
3739
SBO
3840
Slib
@@ -58,6 +60,7 @@ gitlab
5860
gpu
5961
gridlay
6062
gsoc
63+
hpc
6164
jit
6265
jthread
6366
linkedin
@@ -70,8 +73,11 @@ openmp
7073
pushforward
7174
pythonized
7275
ramview
76+
ramntupleview
7377
reoptimize
78+
rntuple
7479
samtools
80+
samtoramntuple
7581
sbo
7682
sitemap
7783
softsusy
@@ -155,4 +161,5 @@ cartopiax
155161
Oncoprotein
156162
oncoprotein
157163
organoids
158-
paraview
164+
paraview
165+

_data/crconlist2025.yml

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@
2323
the Out-of-Process JIT execution which was the major part in implementing
2424
the debugger.
2525
26-
2726
# slides: /assets/presentations/...
2827

2928
- title: "Activity analysis for reverse-mode differentiation of (CUDA) GPU kernels"
@@ -60,7 +59,6 @@
6059
AST parsing and designing corresponding differentiation strategies. Additional
6160
contributions include example applications and comprehensive tests.
6261
63-
6462
# slides: /assets/presentations/...
6563

6664
- title: "Using ROOT in the field of Genome Sequencing"
@@ -83,8 +81,7 @@
8381
FASTQ compression from 14.2GB to 6.8GB. We also developed chromosome based
8482
file-splitting for larger genome file so that chromosome based data can be extracted.
8583
86-
87-
# slides: /assets/presentations/...
84+
slides: /assets/presentations/Aditya_Pandey_GSoC2025_final.pdf
8885
8986
- name: "CompilerResearchCon 2025 (day 1)"
9087
date: 2025-10-30 15:00:00 +0200
@@ -194,3 +191,4 @@
194191
comprehensive unit tests.
195192
196193
slides: /assets/presentations/Abdelrhman_final_presentation_support_usage_of_Thrust_API_in_clad.pdf
194+

_data/standing_meetings.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -441,4 +441,8 @@
441441
date: 2025-10-30 15:20:00 +0200
442442
speaker: "Rohan Timmaraju"
443443
link: "[Slides](/assets/presentations/Rohan_Timmaraju_GSoC25_final.pdf)"
444+
- title: "Final Presentation: Using ROOT in the field of Genome Sequencing"
445+
date: 2025-11-13 16:20:00 +0200
446+
speaker: "Aditya Pandey"
447+
link: "[Slides](/assets/presentations/Aditya_Pandey_GSoC2025_final.pdf)"
444448

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
---
2+
title: "RAMTools: Extending ROOT for Genomic Data Processing"
3+
layout: post
4+
excerpt: "A GSoC 2025 project extending CERN's ROOT framework with the RNTuple format to efficiently process, store, and query large-scale genomic data."
5+
sitemap: true
6+
author: Aditya Pandey
7+
permalink: blogs/gsoc25_aditya_pandey_final_blog/
8+
banner_image: /images/blog/gsoc-banner.png
9+
date: 2025-11-15
10+
tags: gsoc c++ genomics bioinformatics cern root rntuple hpc
11+
---
12+
13+
## Introduction
14+
15+
Hello! I'm Aditya Pandey, and this summer I had the privilege of participating in Google Summer of Code (GSoC) 2025 with CERN-HSF as part of the Compiler Research Group. It has been an incredible experience working with my mentors, Vassil Vassilev and Martin Vassilev, on a project that bridges the gap between high-energy physics (HEP) and genomics.
16+
17+
## Project Overview
18+
19+
RAMTools is a project that extends ROOT—CERN's data processing framework—to efficiently handle genomic sequencing data. While ROOT was designed for petabyte-scale physics data, its cutting-edge features are perfectly suited for the challenges of modern genomics.
20+
21+
The core problem with traditional genomic formats like SAM/BAM is that they are row-oriented, making analytical queries on massive datasets slow and inefficient. My project introduces **RAM (ROOT Alignment/Map)**, a new system that leverages ROOT's latest columnar format, RNTuple. This provides:
22+
23+
- **Columnar Storage**: Optimal for fast analytical queries and high compression ratios.
24+
- **Parallel I/O**: Built-in support for concurrent read/write operations.
25+
- **Modern Compression**: Support for multiple algorithms (LZ4, LZMA, ZLIB, ZSTD).
26+
27+
By converting SAM data to the RNTuple format, we can achieve significant performance gains in both storage and query speed.
28+
29+
## Technical Implementation
30+
31+
The project was implemented in C++17 and built using CMake, relying on the ROOT framework (version 6.26+) for its RNTuple I/O subsystem.
32+
33+
### Architecture Components
34+
35+
1. **SAM Parser**: A custom, high-performance C++17 parser optimized for streaming and processing extremely large SAM files.
36+
37+
2. **RNTuple Writer**: An efficient data model that maps the fields of a SAM record (QNAME, FLAG, RNAME, POS, etc.) to a columnar RNTuple structure.
38+
39+
3. **Chromosome Splitter**: A key feature that allows for partitioning the output into separate files by chromosome, enabling trivial parallel processing of downstream analysis.
40+
41+
4. **Region Query Engine**: A fast query tool that leverages RNTuple's selective column reading to extract genomic regions (e.g., chr1:10150-10300) without reading the entire file.
42+
43+
### Command-Line Tools
44+
45+
The primary interaction with RAMTools is through two command-line executables:
46+
47+
#### SAM to RAM Conversion (`samtoramntuple`)
48+
49+
Converts a standard SAM file into the optimized RNTuple-based RAM format.
50+
51+
```bash
52+
# Basic conversion
53+
./tools/samtoramntuple input.sam output.root
54+
55+
# Split by chromosome for parallel processing
56+
# (Creates output-chr1.root, output-chr2.root, etc.)
57+
./tools/samtoramntuple input.sam output -split
58+
```
59+
60+
#### Region Querying (`ramntupleview`)
61+
62+
Queries a specific genomic region from a RAM file, similar to `samtools view`.
63+
64+
```bash
65+
# Usage: ./tools/ramntupleview [input.root] "[chromosome]:[start]-[end]"
66+
./tools/ramntupleview output.root "chr1:10150-10300"
67+
```
68+
69+
## Performance Achievements
70+
71+
We benchmarked RAMTools using the HG00154 sample from the 1000 Genomes Project, which consists of 196 million reads in a 72.1 GB uncompressed SAM file.
72+
73+
### Query Performance Comparison
74+
75+
RNTuple's columnar architecture shows significant speedups, especially for large region queries, when compared to the older ROOT TTree format and CRAM (industry-standard compressed format).
76+
77+
![Region Query Performance](/images/blog/genome_query_time.png)
78+
79+
The benchmarks demonstrate performance across three query sizes:
80+
81+
| Query Region | Size Category | Region Coordinates | RNTuple Time (s) | TTree Time (s) | CRAM Time (s) |
82+
|--------------|--------------|-------------------|------------------|----------------|---------------|
83+
| Small | 50M | chr1:1-50M | 6.69 | 1.29 | 0.34 |
84+
| Medium | 48M | chr21:1-48M | 6.84 | 35.70 | 7.81 |
85+
| Large | 100M | chr2:1-100M | 8.92 | 87.80 | 21.71 |
86+
87+
For the small region (chr1:1-50M), CRAM performs best due to its reference-based compression optimizations for sequential access. However, as query size increases:
88+
89+
- **Medium queries (chr21:1-48M)**: RNTuple is **5.2x faster** than TTree and competitive with CRAM
90+
- **Large queries (chr2:1-100M)**: RNTuple is **9.8x faster** than TTree and **2.4x faster** than CRAM
91+
92+
The performance advantage of RNTuple becomes more pronounced with larger analytical queries, making it ideal for whole-chromosome or multi-gene region analyses common in genomics research.
93+
94+
### Storage and Compression
95+
96+
RNTuple also provides excellent compression. The 72.1 GB SAM file was compressed down to 11.4 GB using ZSTD, a 6.3x compression ratio.
97+
98+
| Format | Compression Algo | File Size (GB) | Additional Requirements | Total Storage (GB) |
99+
|--------|-----------------|----------------|------------------------|-------------------|
100+
| SAM | Uncompressed | 72.1 | - | 72.1 |
101+
| CRAM | Reference-based | 7.8 | 3.2 GB reference file | 11.0 |
102+
| RAM-RNTuple | ZSTD | 11.4 | Self-contained | 11.4 |
103+
| RAM-TTree | LZMA | 12.5 | - | 12.5 |
104+
| RAM-TTree | ZLIB | 16.7 | - | 16.7 |
105+
| RAM-TTree | LZ4 | 31.2 | - | 31.2 |
106+
107+
The most significant achievement here is that the 11.4 GB RNTuple file is **completely self-contained**. This is a key advantage over formats like CRAM, which achieves a similar total storage size (11.0 GB) but is dependent on an external 3.2 GB reference genome. This self-contained nature simplifies data archival, distribution, and use in cloud environments immensely.
108+
109+
## Repository & Documentation
110+
111+
- **GitHub**: [RAMTools Repository](https://github.com/compiler-research/ramtools)
112+
113+
## Future Work
114+
115+
While GSoC has concluded, there is a clear path forward for RAMTools:
116+
117+
1. **More format Support**: Support for more formats for wide adaptation.
118+
119+
2. **Further Query Optimization**: Explore multi-threading in the query engine to parallelize data retrieval.
120+
121+
3. **Integration with Analysis Frameworks**: Investigate integration with popular bioinformatics frameworks or visualization tools.
122+
123+
## Conclusion
124+
125+
GSoC 2025 has been a phenomenal experience. I've had the opportunity to dive deep into high-performance C++ and solve real-world problems in genomics.
126+
127+
I am immensely grateful to my mentors, Vassil Vassilev and Martin Vassilev, for their invaluable guidance, insightful code reviews, and constant support. I also want to extend my thanks to the entire ROOT team, CERN-HSF, and Google for making this project possible. I look forward to continuing my contributions to this exciting intersection of science and technology.
128+
125 KB
Binary file not shown.

images/blog/genome_query_time.png

96.8 KB
Loading

0 commit comments

Comments
 (0)