This is a production-ready system built in Rust that combines high-performance asynchronous processing, AI/ML integration with the Gemini API, multi-format document handling, and security practices.
Our API implements a multi-layered architecture to tackle the problem statement and all test cases.
+===================================================+
| main.rs (Interactive CLI) |
+---------------------------------------------------+
| server.rs (API Gateway) |
+---------------------------------------------------+
| final_challenge.rs (Contest Logic) |
+---------------------------------------------------+
| ai/embed.rs (Vector Database Layer) |
| ai/gemini.rs (LLM Intelligence Layer) |
+---------------------------------------------------+
| pdf.rs + ocr.rs (Processing Pipeline) |
+---------------------------------------------------+
| MySQL (Persistent Vector Store) |
+===================================================+
## HackerXAPI Architecture:
├── main.rs (Interactive CLI) <br />
├── server.rs (API Gateway) <br />
├── final_challenge.rs (Contest Logic) <br />
├── AI Layer: <br />
│ ├── embed.rs (Vector Database Layer) <br />
│ └── gemini.rs (LLM Intelligence Layer) <br />
├── Processing Layer: <br />
│ ├── pdf.rs (Document Processing) <br />
│ └── ocr.rs (OCR Pipeline) <br />
└── MySQL (Persistent Vector Store) <br />
- Intelligent Document Processing: Handles a wide array of file types (
PDF,DOCX,XLSX,PPTX,JPEG,PNG,TXT) with a robust fallback chain. - High-Performance AI: Leverages the Gemini API with optimized chunking, parallel processing, and smart context filtering for fast, relevant responses.
- Enterprise-Grade Security: Features multi-layer security, including extensive prompt injection sanitization, and parameterized SQL queries.
- Scalable Architecture: Built with a stateless design,
tokiofor async operations, and CPU-aware parallelization for horizontal scaling. - Interactive Management: Includes a menu-driven CLI for easy server management, status monitoring, and graceful shutdowns.
The system is designed as a series of specialized layers, from the user-facing API and CLI down to the persistent database storage.
flowchart TD
%% Entry Point
A[main.rs CLI Menu] -->|Start Server| B[Axum Server :8000]
A -->|Show Status| A2[Status Placeholder]
A -->|Exit| A3[Program Exit]
%% Server Request Handler
B -->|POST /api/v1/hackrx/run| C[server::hackrx_run]
C --> C1{Bearer Token Valid?}
C1 -->|No| E401([401 Unauthorized])
C1 -->|Yes| C2[generate_filename_from_url]
%% Document Processing Pipeline
C2 --> D1{File exists locally?}
D1 -->|No| D2[download_file with extension validation]
D1 -->|Yes| D3[Skip download]
D2 --> D4[extract_file_text]
D3 --> D4
%% Multi-Format Text Extraction
subgraph Extraction [Text Extraction Layer]
D4 --> EXT1{File Extension?}
EXT1 -->|PDF| EXT_PDF[Parallel PDF processing with pdftk/qpdf]
EXT1 -->|DOCX| EXT_DOCX[ZIP extraction to XML parsing]
EXT1 -->|XLSX| EXT_XLSX[Calamine spreadsheet to text]
EXT1 -->|PPTX| EXT_PPTX[ImageMagick or LibreOffice to OCR]
EXT1 -->|PNG/JPEG| EXT_IMG[Direct OCR with ocrs CLI]
EXT1 -->|TXT| EXT_TXT[Token regex extraction]
EXT_PDF --> TXT_OUT[Save to pdfs/filename.txt]
EXT_DOCX --> TXT_OUT
EXT_XLSX --> TXT_OUT
EXT_PPTX --> TXT_OUT
EXT_IMG --> TXT_OUT
EXT_TXT --> TXT_OUT
end
%% Embeddings and Vector Storage
TXT_OUT --> EMB_START[get_policy_chunk_embeddings]
subgraph Embeddings [Vector Embeddings System]
EMB_START --> EMB1{Embeddings exist in MySQL?}
EMB1 -->|Yes| EMB_LOAD[Load from pdf_embeddings table]
EMB1 -->|No| EMB_CHUNK[Chunk text into 33k char pieces]
EMB_CHUNK --> EMB_API[Parallel Gemini Embedding API calls]
EMB_API --> EMB_STORE[Batch store to MySQL]
EMB_LOAD --> EMB_RETURN[Return chunk embeddings]
EMB_STORE --> EMB_RETURN
end
%% Context-Aware Retrieval
EMB_RETURN --> CTX_START[rewrite_policy_with_context]
subgraph Context_RAG [Context Selection RAG]
CTX_START --> CTX1[Embed combined questions]
CTX1 --> CTX2[Cosine similarity calculation]
CTX2 --> CTX3[Select top 10 relevant chunks]
CTX3 --> CTX4[Write contextfiltered.txt]
end
%% Answer Generation
CTX4 --> ANS_START[answer_questions]
subgraph Answer_Gen [Answer Generation]
ANS_START --> ANS1[Load filtered context]
ANS1 --> ANS2[Sanitize against prompt injection]
ANS2 --> ANS3[Gemini 2.0 Flash API call]
ANS3 --> ANS4[Parse structured JSON response]
ANS4 --> ANS_END[Extract answers array]
end
%% Final Response
ANS_END --> SUCCESS([200 OK JSON Response])
%% Error Handling
C --> ERR_HANDLER[Error Handler]
ERR_HANDLER --> ERR_RESPONSE([4xx/5xx Error Response])
%% External Dependencies
subgraph External [External Tools & Services]
EXT_TOOLS[pdftk, qpdf, ImageMagick, LibreOffice, ocrs, pdftoppm]
MYSQL_DB[(MySQL Database)]
GEMINI_API[Google Gemini API]
end
Extraction -.-> EXT_TOOLS
Embeddings -.-> MYSQL_DB
Embeddings -.-> GEMINI_API
Answer_Gen -.-> GEMINI_API
This layer handles all interactions with the AI model and vector embeddings, featuring performance optimizations and smart context filtering.
- Chunking Strategy: Text is split into
33,000character chunks, which is optimal for the Gemini API. - Parallel Processing: Handles up to
50concurrent requests usingfutures::streamfor high throughput. - Database Caching: Caches embedding vectors in MySQL to avoid redundant and costly API calls.
- Batch Operations: Uses functions like
batch_store_pdf_embeddingsfor efficient bulk database insertions.
- Top-K Retrieval: Fetches the top
10most relevant document chunks for any given query. - Similarity Threshold: Enforces a minimum relevance score of
0.5(cosine similarity) to ensure context quality. - Combined Query Embedding: Creates a single, more effective embedding when multiple user questions are asked at once.
- Prompt Injection Defense: Proactively sanitizes all user input against a list of over 22 known prompt injection patterns to protect the LLM.
// Cosine similarity with proper error handling
fn cosine_similarity(vec1: &[f32], vec2: &[f32]) -> f32 {
let dot_product: f32 = vec1.iter().zip(vec2.iter()).map(|(a, b)| a * b).sum();
let magnitude1: f32 = vec1.iter().map(|v| v * v).sum::<f32>().sqrt();
let magnitude2: f32 = vec2.iter().map(|v| v * v).sum::<f32>().sqrt();
// ... proper zero-magnitude handling
}- Chunking Strategy: Text is split into
33,000character chunks, which is optimal for the Gemini API. - Parallel Processing: Handles up to
50concurrent requests usingfutures::stream. - Database Caching: Caches embedding vectors in MySQL using the native
JSONdata type. - Batch Operations: Uses functions like
batch_store_pdf_embeddingsfor high-performance bulk database insertions.
- Top-K Retrieval: Fetches the
10most relevant document chunks for any given query. - Similarity Threshold: Enforces a minimum relevance score of
0.5(cosine similarity) to ensure context quality. - Combined Query Embedding: Creates a single, more effective embedding when multiple user questions are asked at once.
This component showcases enterprise-level security and reliability in its integration with the Gemini model.
fn sanitize_policy(content: &str) -> String {
let dangerous_patterns = [
r"(?i)ignore\s+previous\s+instructions",
r"(?i)disregard\s+the\s+above",
r"(?i)pretend\s+to\s+be",
// ... 22 different injection patterns
];
// Regex-based sanitization
}- Structured Output: Enforces a JSON schema for consistent and predictable LLM responses.
- Cache Busting: Uses UUIDs to ensure request uniqueness where needed.
- Response Validation: Implements multi-layer JSON parsing.
- Prompt Engineering: Constructs context-aware prompts for more accurate results.
The system will support the following files for text extraction: File Type Support Matrix:
match ext.as_str() {
"docx" => convert_docx_to_pdf(file_path)?,
"xlsx" => convert_xlsx_to_pdf(file_path)?,
"pdf" => extract_pdf_text_sync(file_path),
"jpeg" | "png" => crate::ocr::extract_text_with_ocrs(file_path),
"pptx" => extract_text_from_pptx(file_path),
"txt" => extract_token_from_text(file_path),
}- CPU-Aware Parallelization: Uses
num_cpus::get()to spawn an optimal number of threads for processing. - Memory-Safe Concurrency: Leverages
Arc<String>for safe, shared ownership of data across parallel tasks. - Chunk-based PDF Processing: Intelligently splits large PDFs into chunks to be processed in parallel across CPU cores.
- Tool Fallback Chain: Implements a resilient processing strategy, trying
pdftk, thenqpdf, and finally falling back to estimation if needed.
let page_ranges: Vec<(usize, usize)> = (0..num_cores)
.map(|i| {
let start = i * pages_per_chunk + 1;
let end = ((i + 1) * pages_per_chunk).min(total_pages);
(start, end)
})
.collect();The system also uses OCR to parse text from images or pptx files
Multi-Tool Pipeline:
- Primary:
ImageMagickdirect conversion. - Fallback: A
LibreOffice→ PDF → Images chain. - OCR Engine: Uses
ocrs-clifor the final text extraction. - Format Chain: A dedicated PPTX → Images → OCR → Text chain.
Quality Optimization:
- DPI Settings: Balances quality vs. speed with a
150 DPIsetting. - Background Processing: Enforces a white background and alpha removal for better accuracy.
- Slide Preservation: Maintains original slide order and numbering throughout the process.
The server implements intelligent request routing and security.
Security Middleware:
let auth = headers.get("authorization")
.and_then(|value| value.to_str().ok());
if auth.is_none() || !auth.unwrap().starts_with("Bearer ") {
return Err(StatusCode::UNAUTHORIZED);
}- URL-to-Filename Generation: Intelligently detects file types from URLs.
- Special Endpoint Handling: Dedicated logic for handling endpoints in documents.
- File Existence Checking: Avoids redundant downloads by checking for existing vectors in the database first.
Advanced Features:
- Final Challenge Detection: Special handling for contest-specific files.
- Error Response Standardization: Returns errors in a consistent JSON format.
- Performance Monitoring: Includes request timing and logging for observability.
This module provides a user-friendly, menu-driven interface for managing the server.
Menu-Driven Architecture:
- Graceful Shutdown: Handles
Ctrl+Cfor proper cleanup before exiting. - Server Management: Allows starting and stopping the server with status monitoring.
- Error Recovery: Robustly handles invalid user input without crashing.
Tokio Runtime Utilization:
tokio::task::spawn_blocking(move || extract_file_text_sync(&file_path)).await?Concurrency Patterns:
- Stream Processing: Uses
buffer_unordered(PARALLEL_REQS)for high-throughput, parallel stream processing. - Future Composition: Employs
tokio::select!for gracefully handling multiple asynchronous operations, such as a task and a shutdown signal. - Blocking Task Spawning: Correctly offloads CPU-bound work to a dedicated thread pool to avoid blocking the async runtime.
Connection Pool Management:
static DB_POOL: Lazy<Pool> = Lazy::new(|| {
let opts = Opts::from_url(&database_url).expect("Invalid database URL");
Pool::new(opts).expect("Failed to create database pool")
});Performance Optimizations:
- Batch Insertions: Commits multiple embeddings in a single transaction for efficiency.
- Index Strategy: Uses dedicated indexes like
idx_pdf_filenameandidx_chunk_indexfor fast lookups. - JSON Storage: Uses MySQL's native
JSONdata type for optimal embedding storage and retrieval.
Rust Best Practices:
- RAII Pattern: Guarantees automatic cleanup of temporary files and other resources when they go out of scope.
Arc<T>: EmploysArcfor safe, shared ownership of data in parallel processing contexts.Result<T, E>: Uses comprehensive error propagation throughout the application for robust failure handling.Option<T>: Ensures null safety across the entire codebase.
- Input Sanitization: Defends against prompt injection attacks.
- File Type Validation: Uses a whitelist-based approach for processing file types.
- Payload Limits: Enforces request limits (e.g., 35KB on embeddings) for staying within API limits. This can be removed for a big performance gain.
- SQL Injection Prevention: Exclusively uses parameterized queries to protect the database.
Graceful Degradation:
- Tool Fallbacks: Implements a chain of multiple OCR and file conversion tools to maximize success rates.
- File Recovery: Reuses existing files to recover from partial processing failures.
- API Resilience: Provides proper HTTP status codes and clear, standardized error messages.
- Concurrent Embeddings: Processes up to 50 parallel requests This is of course, limited by the API rate limits. Removing it will improve performance greatly.
- Chunk Processing: Utilizes CPU-core optimized parallel processing for large PDFs.
- Database & Caching: Leverages connection pooling and file caching to maximize token use and be as efficient as possible.
- Relevance Filter: A
0.5cosine similarity score is the minimum for context retrieval. - Context Window: Uses the top 10 chunks to provide optimal context to the LLM. A higher context window increases accuracy even further.
- OCR Quality: Balances speed and accuracy with a
150 DPIsetting.
- Stateless Design: Each request is independent, making it easy to scale and multithread.
- Observability: Includes comprehensive logging and timing measurements for every case.
- Configuration: All configuration is managed via environment variables for easy deployment.
- Resource Management: Temporary files are cleaned up automatically via the RAII pattern.
- API Standards: Adheres to RESTful design principles with proper HTTP semantics.
- Built in Rust: We chose rust to make the API as fast as possible.
- Persistent Vector Store: The MySQL Database is perfect for company level usage of the system, where a document is queried constantly by both employees and clients.
- Handles all Documents: A chain of tools with fallbacks ensures that the system handles as many document types as possible.
- Context-Aware Embedding: Combines multiple questions into a single embedding for token efficiency.
- Prompt Injection Protecton: Features prompt injection protection.
curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs) | shThis command is for Debian/Ubuntu-based systems.
sudo apt-get update
sudo apt-get install pdftk-java qpdf poppler-utils libglib2.0-dev libcairo2-dev libpoppler-glib-dev bc libreoffice imagemagickcargo install miniserve
cargo install ocrs-cli --lockedCreate a .env file from the example:
cp .envexample .envCreate a MySQL database and run the following schema:
CREATE TABLE pdf_embeddings (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
pdf_filename VARCHAR(255) NOT NULL,
chunk_text TEXT NOT NULL,
chunk_index INTEGER NOT NULL,
embedding JSON NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_pdf_filename (pdf_filename),
INDEX idx_chunk_index (chunk_index)
);Then, populate your .env file with the database connection string and your Gemini API key:
MYSQL_CONNECTION=mysql://username:password@localhost:3306/your_database
GEMINI_KEY=your_gemini_api_keycargo runThe repository includes three scripts with various payloads to test the API with different document types:
./test.sh
./sim.sh
./simr4.sh- Rust (latest stable)
- MySQL database
- Google Gemini API key
- System packages for document processing (listed in step 2)
- OCR tools for image text extraction (listed in step 3)