Genome Sequencing Project final Blog and presentation (#345)

AdityaPandeyCN · web-flow · commit 533cbe15d151 · 2025-11-15T11:18:42.000+02:00
* final blog and presentation

Signed-off-by: AdityaPandeyCN &lt;adityapand3y666@gmail.com&gt;

Signed-off-by: AdityaPandeyCN &lt;adityapand3y666@gmail.com&gt;

* added spellings

Signed-off-by: AdityaPandeyCN &lt;adityapand3y666@gmail.com&gt;

* alphabetical order

Signed-off-by: AdityaPandeyCN &lt;adityapand3y666@gmail.com&gt;

---------

Signed-off-by: AdityaPandeyCN &lt;adityapand3y666@gmail.com&gt;
diff --git a/.github/actions/spelling/allow/terms.txt b/.github/actions/spelling/allow/terms.txt
@@ -32,7 +32,9 @@ Ohridski
 OMP
 OpenMP
 PTX
+QNAME
 RAII
+RNAME
 Resugaring
 SBO
 Slib
@@ -58,6 +60,7 @@ gitlab
 gpu
 gridlay
 gsoc
+hpc
 jit
 jthread
 linkedin
@@ -70,8 +73,11 @@ openmp
 pushforward
 pythonized
 ramview
+ramntupleview
 reoptimize
+rntuple
 samtools
+samtoramntuple
 sbo
 sitemap
 softsusy
@@ -155,4 +161,5 @@ cartopiax
 Oncoprotein
 oncoprotein
 organoids
-paraview
+paraview
+
diff --git a/_data/crconlist2025.yml b/_data/crconlist2025.yml
@@ -23,7 +23,6 @@
         the Out-of-Process JIT execution which was the major part in implementing 
         the debugger. 
 
-
       # slides: /assets/presentations/...
 
     - title: "Activity analysis for reverse-mode differentiation of (CUDA) GPU kernels"
@@ -60,7 +59,6 @@
         AST parsing and designing corresponding differentiation strategies. Additional 
         contributions include example applications and comprehensive tests.
 
-        
       # slides: /assets/presentations/...
 
     - title: "Using ROOT in the field of Genome Sequencing"
@@ -83,8 +81,7 @@
         FASTQ compression from 14.2GB to 6.8GB. We also developed chromosome based 
         file-splitting for larger genome file so that chromosome based data can be extracted. 
 
-        
-      # slides: /assets/presentations/...
+        slides: /assets/presentations/Aditya_Pandey_GSoC2025_final.pdf
 
 - name: "CompilerResearchCon 2025 (day 1)"
   date: 2025-10-30 15:00:00 +0200
@@ -194,3 +191,4 @@
         comprehensive unit tests.
         
       slides: /assets/presentations/Abdelrhman_final_presentation_support_usage_of_Thrust_API_in_clad.pdf
+
diff --git a/_data/standing_meetings.yml b/_data/standing_meetings.yml
@@ -441,4 +441,8 @@
       date: 2025-10-30 15:20:00 +0200
       speaker: "Rohan Timmaraju"
       link: "[Slides](/assets/presentations/Rohan_Timmaraju_GSoC25_final.pdf)"
+    - title: "Final Presentation: Using ROOT in the field of Genome Sequencing"
+      date: 2025-11-13 16:20:00 +0200
+      speaker: "Aditya Pandey"
+      link: "[Slides](/assets/presentations/Aditya_Pandey_GSoC2025_final.pdf)"
 
diff --git a/_posts/2025-11-14-using-root-in-the-field-of-genome-sequencing.md b/_posts/2025-11-14-using-root-in-the-field-of-genome-sequencing.md
@@ -0,0 +1,128 @@
+---
+title: "RAMTools: Extending ROOT for Genomic Data Processing"
+layout: post
+excerpt: "A GSoC 2025 project extending CERN's ROOT framework with the RNTuple format to efficiently process, store, and query large-scale genomic data."
+sitemap: true
+author: Aditya Pandey
+permalink: blogs/gsoc25_aditya_pandey_final_blog/
+banner_image: /images/blog/gsoc-banner.png
+date: 2025-11-15
+tags: gsoc c++ genomics bioinformatics cern root rntuple hpc
+---
+
+## Introduction
+
+Hello! I'm Aditya Pandey, and this summer I had the privilege of participating in Google Summer of Code (GSoC) 2025 with CERN-HSF as part of the Compiler Research Group. It has been an incredible experience working with my mentors, Vassil Vassilev and Martin Vassilev, on a project that bridges the gap between high-energy physics (HEP) and genomics.
+
+## Project Overview
+
+RAMTools is a project that extends ROOT—CERN's data processing framework—to efficiently handle genomic sequencing data. While ROOT was designed for petabyte-scale physics data, its cutting-edge features are perfectly suited for the challenges of modern genomics.
+
+The core problem with traditional genomic formats like SAM/BAM is that they are row-oriented, making analytical queries on massive datasets slow and inefficient. My project introduces **RAM (ROOT Alignment/Map)**, a new system that leverages ROOT's latest columnar format, RNTuple. This provides:
+
+- **Columnar Storage**: Optimal for fast analytical queries and high compression ratios.
+- **Parallel I/O**: Built-in support for concurrent read/write operations.
+- **Modern Compression**: Support for multiple algorithms (LZ4, LZMA, ZLIB, ZSTD).
+
+By converting SAM data to the RNTuple format, we can achieve significant performance gains in both storage and query speed.
+
+## Technical Implementation
+
+The project was implemented in C++17 and built using CMake, relying on the ROOT framework (version 6.26+) for its RNTuple I/O subsystem.
+
+### Architecture Components
+
+1. **SAM Parser**: A custom, high-performance C++17 parser optimized for streaming and processing extremely large SAM files.
+
+2. **RNTuple Writer**: An efficient data model that maps the fields of a SAM record (QNAME, FLAG, RNAME, POS, etc.) to a columnar RNTuple structure.
+
+3. **Chromosome Splitter**: A key feature that allows for partitioning the output into separate files by chromosome, enabling trivial parallel processing of downstream analysis.
+
+4. **Region Query Engine**: A fast query tool that leverages RNTuple's selective column reading to extract genomic regions (e.g., chr1:10150-10300) without reading the entire file.
+
+### Command-Line Tools
+
+The primary interaction with RAMTools is through two command-line executables:
+
+#### SAM to RAM Conversion (`samtoramntuple`)
+
+Converts a standard SAM file into the optimized RNTuple-based RAM format.
+
+```bash
+# Basic conversion
+./tools/samtoramntuple input.sam output.root
+
+# Split by chromosome for parallel processing
+# (Creates output-chr1.root, output-chr2.root, etc.)
+./tools/samtoramntuple input.sam output -split
+```
+
+#### Region Querying (`ramntupleview`)
+
+Queries a specific genomic region from a RAM file, similar to `samtools view`.
+
+```bash
+# Usage: ./tools/ramntupleview [input.root] "[chromosome]:[start]-[end]"
+./tools/ramntupleview output.root "chr1:10150-10300"
+```
+
+## Performance Achievements
+
+We benchmarked RAMTools using the HG00154 sample from the 1000 Genomes Project, which consists of 196 million reads in a 72.1 GB uncompressed SAM file.
+
+### Query Performance Comparison
+
+RNTuple's columnar architecture shows significant speedups, especially for large region queries, when compared to the older ROOT TTree format and CRAM (industry-standard compressed format).
+
+![Region Query Performance](/images/blog/genome_query_time.png)
+
+The benchmarks demonstrate performance across three query sizes:
+
+| Query Region | Size Category | Region Coordinates | RNTuple Time (s) | TTree Time (s) | CRAM Time (s) |
+|--------------|--------------|-------------------|------------------|----------------|---------------|
+| Small | 50M | chr1:1-50M | 6.69 | 1.29 | 0.34 |
+| Medium | 48M | chr21:1-48M | 6.84 | 35.70 | 7.81 |
+| Large | 100M | chr2:1-100M | 8.92 | 87.80 | 21.71 |
+
+For the small region (chr1:1-50M), CRAM performs best due to its reference-based compression optimizations for sequential access. However, as query size increases:
+
+- **Medium queries (chr21:1-48M)**: RNTuple is **5.2x faster** than TTree and competitive with CRAM
+- **Large queries (chr2:1-100M)**: RNTuple is **9.8x faster** than TTree and **2.4x faster** than CRAM
+
+The performance advantage of RNTuple becomes more pronounced with larger analytical queries, making it ideal for whole-chromosome or multi-gene region analyses common in genomics research.
+
+### Storage and Compression
+
+RNTuple also provides excellent compression. The 72.1 GB SAM file was compressed down to 11.4 GB using ZSTD, a 6.3x compression ratio.
+
+| Format | Compression Algo | File Size (GB) | Additional Requirements | Total Storage (GB) |
+|--------|-----------------|----------------|------------------------|-------------------|
+| SAM | Uncompressed | 72.1 | - | 72.1 |
+| CRAM | Reference-based | 7.8 | 3.2 GB reference file | 11.0 |
+| RAM-RNTuple | ZSTD | 11.4 | Self-contained | 11.4 |
+| RAM-TTree | LZMA | 12.5 | - | 12.5 |
+| RAM-TTree | ZLIB | 16.7 | - | 16.7 |
+| RAM-TTree | LZ4 | 31.2 | - | 31.2 |
+
+The most significant achievement here is that the 11.4 GB RNTuple file is **completely self-contained**. This is a key advantage over formats like CRAM, which achieves a similar total storage size (11.0 GB) but is dependent on an external 3.2 GB reference genome. This self-contained nature simplifies data archival, distribution, and use in cloud environments immensely.
+
+## Repository & Documentation
+
+- **GitHub**: [RAMTools Repository](https://github.com/compiler-research/ramtools)
+
+## Future Work
+
+While GSoC has concluded, there is a clear path forward for RAMTools:
+
+1. **More format Support**: Support for more formats for wide adaptation.
+
+2. **Further Query Optimization**: Explore multi-threading in the query engine to parallelize data retrieval.
+
+3. **Integration with Analysis Frameworks**: Investigate integration with popular bioinformatics frameworks or visualization tools.
+
+## Conclusion
+
+GSoC 2025 has been a phenomenal experience. I've had the opportunity to dive deep into high-performance C++ and solve real-world problems in genomics.
+
+I am immensely grateful to my mentors, Vassil Vassilev and Martin Vassilev, for their invaluable guidance, insightful code reviews, and constant support. I also want to extend my thanks to the entire ROOT team, CERN-HSF, and Google for making this project possible. I look forward to continuing my contributions to this exciting intersection of science and technology.
+
diff --git a/assets/presentations/Aditya_Pandey_GSoC2025_final.pdf b/assets/presentations/Aditya_Pandey_GSoC2025_final.pdf
diff --git a/images/blog/genome_query_time.png b/images/blog/genome_query_time.png