|
| 1 | +--- |
| 2 | +title: "RAMTools: Extending ROOT for Genomic Data Processing" |
| 3 | +layout: post |
| 4 | +excerpt: "A GSoC 2025 project extending CERN's ROOT framework with the RNTuple format to efficiently process, store, and query large-scale genomic data." |
| 5 | +sitemap: true |
| 6 | +author: Aditya Pandey |
| 7 | +permalink: blogs/gsoc25_aditya_pandey_final_blog/ |
| 8 | +banner_image: /images/blog/gsoc-banner.png |
| 9 | +date: 2025-11-15 |
| 10 | +tags: gsoc c++ genomics bioinformatics cern root rntuple hpc |
| 11 | +--- |
| 12 | + |
| 13 | +## Introduction |
| 14 | + |
| 15 | +Hello! I'm Aditya Pandey, and this summer I had the privilege of participating in Google Summer of Code (GSoC) 2025 with CERN-HSF as part of the Compiler Research Group. It has been an incredible experience working with my mentors, Vassil Vassilev and Martin Vassilev, on a project that bridges the gap between high-energy physics (HEP) and genomics. |
| 16 | + |
| 17 | +## Project Overview |
| 18 | + |
| 19 | +RAMTools is a project that extends ROOT—CERN's data processing framework—to efficiently handle genomic sequencing data. While ROOT was designed for petabyte-scale physics data, its cutting-edge features are perfectly suited for the challenges of modern genomics. |
| 20 | + |
| 21 | +The core problem with traditional genomic formats like SAM/BAM is that they are row-oriented, making analytical queries on massive datasets slow and inefficient. My project introduces **RAM (ROOT Alignment/Map)**, a new system that leverages ROOT's latest columnar format, RNTuple. This provides: |
| 22 | + |
| 23 | +- **Columnar Storage**: Optimal for fast analytical queries and high compression ratios. |
| 24 | +- **Parallel I/O**: Built-in support for concurrent read/write operations. |
| 25 | +- **Modern Compression**: Support for multiple algorithms (LZ4, LZMA, ZLIB, ZSTD). |
| 26 | + |
| 27 | +By converting SAM data to the RNTuple format, we can achieve significant performance gains in both storage and query speed. |
| 28 | + |
| 29 | +## Technical Implementation |
| 30 | + |
| 31 | +The project was implemented in C++17 and built using CMake, relying on the ROOT framework (version 6.26+) for its RNTuple I/O subsystem. |
| 32 | + |
| 33 | +### Architecture Components |
| 34 | + |
| 35 | +1. **SAM Parser**: A custom, high-performance C++17 parser optimized for streaming and processing extremely large SAM files. |
| 36 | + |
| 37 | +2. **RNTuple Writer**: An efficient data model that maps the fields of a SAM record (QNAME, FLAG, RNAME, POS, etc.) to a columnar RNTuple structure. |
| 38 | + |
| 39 | +3. **Chromosome Splitter**: A key feature that allows for partitioning the output into separate files by chromosome, enabling trivial parallel processing of downstream analysis. |
| 40 | + |
| 41 | +4. **Region Query Engine**: A fast query tool that leverages RNTuple's selective column reading to extract genomic regions (e.g., chr1:10150-10300) without reading the entire file. |
| 42 | + |
| 43 | +### Command-Line Tools |
| 44 | + |
| 45 | +The primary interaction with RAMTools is through two command-line executables: |
| 46 | + |
| 47 | +#### SAM to RAM Conversion (`samtoramntuple`) |
| 48 | + |
| 49 | +Converts a standard SAM file into the optimized RNTuple-based RAM format. |
| 50 | + |
| 51 | +```bash |
| 52 | +# Basic conversion |
| 53 | +./tools/samtoramntuple input.sam output.root |
| 54 | + |
| 55 | +# Split by chromosome for parallel processing |
| 56 | +# (Creates output-chr1.root, output-chr2.root, etc.) |
| 57 | +./tools/samtoramntuple input.sam output -split |
| 58 | +``` |
| 59 | + |
| 60 | +#### Region Querying (`ramntupleview`) |
| 61 | + |
| 62 | +Queries a specific genomic region from a RAM file, similar to `samtools view`. |
| 63 | + |
| 64 | +```bash |
| 65 | +# Usage: ./tools/ramntupleview [input.root] "[chromosome]:[start]-[end]" |
| 66 | +./tools/ramntupleview output.root "chr1:10150-10300" |
| 67 | +``` |
| 68 | + |
| 69 | +## Performance Achievements |
| 70 | + |
| 71 | +We benchmarked RAMTools using the HG00154 sample from the 1000 Genomes Project, which consists of 196 million reads in a 72.1 GB uncompressed SAM file. |
| 72 | + |
| 73 | +### Query Performance Comparison |
| 74 | + |
| 75 | +RNTuple's columnar architecture shows significant speedups, especially for large region queries, when compared to the older ROOT TTree format and CRAM (industry-standard compressed format). |
| 76 | + |
| 77 | + |
| 78 | + |
| 79 | +The benchmarks demonstrate performance across three query sizes: |
| 80 | + |
| 81 | +| Query Region | Size Category | Region Coordinates | RNTuple Time (s) | TTree Time (s) | CRAM Time (s) | |
| 82 | +|--------------|--------------|-------------------|------------------|----------------|---------------| |
| 83 | +| Small | 50M | chr1:1-50M | 6.69 | 1.29 | 0.34 | |
| 84 | +| Medium | 48M | chr21:1-48M | 6.84 | 35.70 | 7.81 | |
| 85 | +| Large | 100M | chr2:1-100M | 8.92 | 87.80 | 21.71 | |
| 86 | + |
| 87 | +For the small region (chr1:1-50M), CRAM performs best due to its reference-based compression optimizations for sequential access. However, as query size increases: |
| 88 | + |
| 89 | +- **Medium queries (chr21:1-48M)**: RNTuple is **5.2x faster** than TTree and competitive with CRAM |
| 90 | +- **Large queries (chr2:1-100M)**: RNTuple is **9.8x faster** than TTree and **2.4x faster** than CRAM |
| 91 | + |
| 92 | +The performance advantage of RNTuple becomes more pronounced with larger analytical queries, making it ideal for whole-chromosome or multi-gene region analyses common in genomics research. |
| 93 | + |
| 94 | +### Storage and Compression |
| 95 | + |
| 96 | +RNTuple also provides excellent compression. The 72.1 GB SAM file was compressed down to 11.4 GB using ZSTD, a 6.3x compression ratio. |
| 97 | + |
| 98 | +| Format | Compression Algo | File Size (GB) | Additional Requirements | Total Storage (GB) | |
| 99 | +|--------|-----------------|----------------|------------------------|-------------------| |
| 100 | +| SAM | Uncompressed | 72.1 | - | 72.1 | |
| 101 | +| CRAM | Reference-based | 7.8 | 3.2 GB reference file | 11.0 | |
| 102 | +| RAM-RNTuple | ZSTD | 11.4 | Self-contained | 11.4 | |
| 103 | +| RAM-TTree | LZMA | 12.5 | - | 12.5 | |
| 104 | +| RAM-TTree | ZLIB | 16.7 | - | 16.7 | |
| 105 | +| RAM-TTree | LZ4 | 31.2 | - | 31.2 | |
| 106 | + |
| 107 | +The most significant achievement here is that the 11.4 GB RNTuple file is **completely self-contained**. This is a key advantage over formats like CRAM, which achieves a similar total storage size (11.0 GB) but is dependent on an external 3.2 GB reference genome. This self-contained nature simplifies data archival, distribution, and use in cloud environments immensely. |
| 108 | + |
| 109 | +## Repository & Documentation |
| 110 | + |
| 111 | +- **GitHub**: [RAMTools Repository](https://github.com/compiler-research/ramtools) |
| 112 | + |
| 113 | +## Future Work |
| 114 | + |
| 115 | +While GSoC has concluded, there is a clear path forward for RAMTools: |
| 116 | + |
| 117 | +1. **More format Support**: Support for more formats for wide adaptation. |
| 118 | + |
| 119 | +2. **Further Query Optimization**: Explore multi-threading in the query engine to parallelize data retrieval. |
| 120 | + |
| 121 | +3. **Integration with Analysis Frameworks**: Investigate integration with popular bioinformatics frameworks or visualization tools. |
| 122 | + |
| 123 | +## Conclusion |
| 124 | + |
| 125 | +GSoC 2025 has been a phenomenal experience. I've had the opportunity to dive deep into high-performance C++ and solve real-world problems in genomics. |
| 126 | + |
| 127 | +I am immensely grateful to my mentors, Vassil Vassilev and Martin Vassilev, for their invaluable guidance, insightful code reviews, and constant support. I also want to extend my thanks to the entire ROOT team, CERN-HSF, and Google for making this project possible. I look forward to continuing my contributions to this exciting intersection of science and technology. |
| 128 | + |
0 commit comments