BrainDrive Document Processing Service is a FastAPI-based microservice for parsing and chunking documents. It serves as the document ingestion component of BrainDrive's Chat with Documents plugin, which uses a Retrieval-Augmented Generation (RAG) pipeline.
Can also be run as a standalone document processing API service following Clean Architecture principles. Processes documents and returns structured chunks with metadata. Features built-in authentication for secure deployment while maintaining simplicity for local development.
git clone https://github.com/BrainDriveAI/Document-Processing-Service.git
cd Document-Processing-Service
poetry installCopy .env.local β .env:
cp .env.local .envAt minimum .env.local should contain:
# Disable authentication (ONLY for local development)
DISABLE_AUTH=false
# Choose authentication method: api_key, jwt, or disabled
AUTH_METHOD=api_key
β οΈ AUTH_API_KEYandJWT_SECRETare automatically generated by thegenerate_keysscripts.
Copy .env.production β .env and update values:
cp .env.production .envFor Linux/macOS:
chmod +x generate_keys.sh
./generate_keys.sh --bothFor Windows (PowerShell):
.\generate_keys.ps1 -Bothpoetry run uvicorn app.main:app --reload --host 0.0.0.0 --port 8080# Without authentication (if auth disabled)
curl -X POST "http://localhost:8000/documents/upload" \
-F "file=@your-document.pdf"
# With API key (if auth enabled)
curl -X POST "http://localhost:8000/documents/upload" \
-H "X-API-Key: your-generated-api-key" \
-F "file=@your-document.pdf"- API Key: Simple header-based authentication
- JWT: Token-based authentication for integration
- Disabled: For local development only
# Make script executable
chmod +x generate_keys.sh
# Generate both API key and JWT secret
./generate_keys.sh --both
# Generate only API key
./generate_keys.sh --api-key
# Generate only JWT secret
./generate_keys.sh --jwt-secret
# Custom options
./generate_keys.sh --both --length 128 --env-file .env.production# Generate both API key and JWT secret
.\generate_keys.ps1 -Both
# Generate only API key
.\generate_keys.ps1 -ApiKey
# Generate only JWT secret
.\generate_keys.ps1 -JwtSecret
# Custom options
.\generate_keys.ps1 -Both -Length 128 -EnvFile .env.production- β
Use
DISABLE_AUTH=falsein production or from other services / clients - β Store API keys securely (environment variables)
- β Use different keys for development and production
- β Rotate keys periodically
- β
Never commit
.envfiles to version control
.env.localβ minimal template for local development.env.productionβ full template for production
Example .env.local:
DISABLE_AUTH=false
AUTH_METHOD=api_key
DEBUG=true
LOG_LEVEL=DEBUGExample .env.production:
DISABLE_AUTH=false
AUTH_METHOD=api_key
AUTH_API_KEY=sk-your-production-api-key
DEBUG=false
LOG_LEVEL=INFO
API_HOST=0.0.0.0
API_PORT=8080
UPLOADS_DIR=/app/data/uploads# Disable authentication (ONLY for local development)
DISABLE_AUTH=false
# Choose authentication method: api_key, jwt, or disabled
AUTH_METHOD=api_key
# API Key Authentication
AUTH_API_KEY=sk-your-generated-api-key-here
# JWT Authentication (alternative to API key)
JWT_SECRET=your-generated-jwt-secret-here
JWT_ALGORITHM=HS256
JWT_EXPIRE_MINUTES=60# Core App Settings
DEBUG=false
API_HOST=0.0.0.0
API_PORT=8000
LOG_LEVEL=INFO
# File Processing
UPLOADS_DIR=data/uploads
UPLOAD_MAX_FILE_SIZE=104857600 # 100MB
UPLOAD_MAX_PART_SIZE=52428800 # 50MB
# Document Processing
SPACY_MODEL=en_core_web_sm
MAX_CONCURRENT_PROCESSES=4
PROCESSING_TIMEOUT=300
# Performance
MAX_CONCURRENT_PROCESSES=4
PROCESSING_TIMEOUT=300# Disable auth for local development
DISABLE_AUTH=true
DEBUG=true
LOG_LEVEL=DEBUG
UPLOADS_DIR=data/uploads# Authentication
DISABLE_AUTH=false
AUTH_METHOD=api_key
AUTH_API_KEY=sk-1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6p7q8r9s0t1u2v3w4x5y6z
# Application
DEBUG=false
LOG_LEVEL=INFO
API_HOST=0.0.0.0
API_PORT=8000
# File processing
UPLOADS_DIR=/app/uploads
UPLOAD_MAX_FILE_SIZE=104857600# Authentication
DISABLE_AUTH=false
AUTH_METHOD=api_key
AUTH_API_KEY=sk-prod-your-secure-api-key-here
# Application
DEBUG=false
LOG_LEVEL=INFO
API_HOST=0.0.0.0
API_PORT=8080
# Cloud-specific paths
UPLOADS_DIR=/tmp/uploads
SPACY_MODEL=en_core_web_sm
# Performance tuning for cloud
MAX_CONCURRENT_PROCESSES=2
PROCESSING_TIMEOUT=180Document-Processing-Service/
βββ app/ # Application code
β βββ core/ # Domain layer (business logic)
β β βββ domain/ # Business entities, value objects, exceptions
β β βββ ports/ # Interfaces/contracts
β β βββ use_cases/ # Application business rules
β βββ adapters/ # External service implementations
β β βββ document_processor/ # Document processing adapters
β β βββ token_service/ # Token counting services
β β βββ auth_service/ # Authentication implementations
β βββ api/ # API layer (routes, dependencies)
β βββ infrastructure/ # Cross-cutting concerns (logging, metrics)
β βββ config.py # Configuration management
β βββ main.py # Application entry point
β
βββ docs/ # Documentation
β βββ OWNERS-MANUAL.md # Complete operations guide (40KB+)
β βββ AI-AGENT-GUIDE.md # Guide for AI coding agents
β βββ README.md # Documentation overview
β βββ COMPOUNDING-SETUP-COMPLETE.md # Verification checklist
β βββ decisions/ # Architecture Decision Records (ADRs)
β β βββ 000-template.md # ADR template
β β βββ 001-docling-for-document-processing.md
β βββ failures/ # Lessons learned (what NOT to do)
β βββ data-quirks/ # Non-obvious system behavior
β β βββ 001-windows-huggingface-symlinks.md
β β βββ 002-git-commit-co-author.md
β β βββ 003-git-commit-message-format.md
β βββ integrations/ # External service documentation
β β βββ docling.md # Docling integration reference
β βββ docker_deployment.md # Docker deployment guide
β
βββ tests/ # Tests
β βββ document-upload.py # Manual API test client
β βββ document-upload.http # HTTP test file
β βββ file_samples/ # Sample documents for testing
β
βββ FOR-AI-CODING-AGENTS.md # Architecture guide for AI agents
βββ docker-compose.yml # Docker Compose configuration
βββ docker-compose.prod.yml # Production Docker Compose
βββ Dockerfile # Docker image definition
βββ generate_keys.sh # Key generation (Linux/macOS)
βββ generate_keys.ps1 # Key generation (Windows)
βββ pyproject.toml # Poetry dependencies
βββ .env.local # Local dev environment template
βββ .env.production # Production environment template
βββ README.md # This file
For Developers:
- Start with
FOR-AI-CODING-AGENTS.mdfor architecture and patterns - Check
docs/decisions/for past architectural decisions
For Operators:
- Read
docs/OWNERS-MANUAL.mdfor complete operations guide - Covers deployment, monitoring, troubleshooting, scaling, disaster recovery
For AI Agents:
- See
FOR-AI-CODING-AGENTS.mdfor architecture guide - See
docs/AI-AGENT-GUIDE.mdfor compounding engineering workflow - Search
docs/for past decisions, failures, quirks before implementing
- Domain Layer: Pure business logic without external dependencies
- Ports: Interfaces defining contracts
- Adapters: Implementations of external services
- Use Cases: Application-specific business rules
- Multi-format Support: PDF, DOCX, DOC, TXT
- Advanced Chunking: Hierarchical, semantic, token-based, adaptive
- Metadata Extraction: Rich metadata from document structure
- spaCy Layout Integration: Sophisticated document understanding
- Authentication: API Key and JWT support
- RESTful Endpoints: Clean, intuitive API design
- Async Processing: Handle large documents efficiently
- Error Handling: Comprehensive error reporting
- Plugin Architecture: Easy to add new processors
- Strategy Pattern: Swappable chunking strategies
- Configuration-driven: Flexible processing parameters
GET / # Service info (includes auth status)
GET /health # Health check (no auth required)
GET /docs # API documentation (no auth required)
GET /metrics # Prometheus metrics (no auth required)
POST /documents/upload # Process document and return chunks
curl -X POST "http://localhost:8000/documents/upload" \
-H "X-API-Key: sk-your-api-key-here" \
-F "file=@document.pdf"curl -X POST "http://localhost:8000/documents/upload" \
-H "Authorization: Bearer your-jwt-token-here" \
-F "file=@document.pdf"{
"document_id": "uuid-here",
"chunks": [
{
"id": "chunk-1",
"text": "Document content...",
"metadata": {
"page": 1,
"position": 0,
"token_count": 150
}
}
],
"metadata": {
"total_chunks": 5,
"processing_time": 2.34,
"document_type": "pdf"
}
}# Build image
docker build -t braindrive-document-ai .
# Run with environment variables
docker run -p 8000:8000 \
-e DISABLE_AUTH=false \
-e API_KEY=your-api-key \
braindrive-document-aiversion: '3.8'
services:
document-ai:
build: .
ports:
- "8000:8000"
environment:
- DISABLE_AUTH=false
- AUTH_API_KEY=sk-your-generated-api-key
- LOG_LEVEL=INFO
volumes:
- ./data:/app/dataimport aiohttp
import os
class DocumentProcessingClient:
def __init__(self, base_url: str, api_key: str):
self.base_url = base_url.rstrip('/')
self.api_key = api_key
async def process_document(self, file_path: str) -> dict:
async with aiohttp.ClientSession() as session:
with open(file_path, 'rb') as f:
form = aiohttp.FormData()
form.add_field('file', f, filename=os.path.basename(file_path))
headers = {'X-API-Key': self.api_key}
async with session.post(
f"{self.base_url}/documents/upload",
data=form,
headers=headers
) as response:
return await response.json()
# Usage
client = DocumentProcessingClient(
base_url="http://localhost:8000",
api_key="sk-your-api-key"
)
result = await client.process_document("document.pdf")# Build and push
gcloud builds submit --tag gcr.io/your-project/braindrive-document-ai
# Deploy
gcloud run deploy braindrive-document-ai \
--image gcr.io/your-project/braindrive-document-ai \
--platform managed \
--region us-central1 \
--set-env-vars DISABLE_AUTH=false,AUTH_API_KEY=your-secure-keyfrom mangum import Mangum
from src.main import app
handler = Mangum(app)apiVersion: apps/v1
kind: Deployment
metadata:
name: braindrive-document-ai
spec:
replicas: 3
selector:
matchLabels:
app: braindrive-document-ai
template:
spec:
containers:
- name: braindrive-document-ai
image: your-registry/braindrive-document-ai:latest
env:
- name: DISABLE_AUTH
value: "false"
- name: AUTH_API_KEY
valueFrom:
secretKeyRef:
name: braindrive-secrets
key: api-key# Install dev dependencies
poetry install --with dev
# Run all tests
poetry run pytest
# Run with coverage
poetry run pytest --cov=src
# Run specific test types
poetry run pytest tests/unit/
poetry run pytest tests/integration/# Test without auth (should fail in production)
curl -X POST "http://localhost:8000/documents/upload" \
-F "file=@test.pdf"
# Test with valid API key
curl -X POST "http://localhost:8000/documents/upload" \
-H "X-API-Key: your-api-key" \
-F "file=@test.pdf"
# Test with invalid API key (should return 401)
curl -X POST "http://localhost:8000/documents/upload" \
-H "X-API-Key: invalid-key" \
-F "file=@test.pdf"- JSON-formatted logs for easy parsing
- Request/response correlation IDs
- Authentication audit logs
- Performance metrics logging
# Access metrics endpoint
curl http://localhost:8000/metrics# Health check (always accessible)
curl http://localhost:8000/health- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Make changes following clean architecture principles
- Add tests for new functionality
- Ensure new feature works correctly
- Update documentation as needed
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
Authentication Errors (401)
- Verify
DISABLE_AUTH=falseis set correctly - Check that
AUTH_API_KEYis set in environment - Ensure client is sending
X-API-Keyheader - Verify API key matches exactly (no extra spaces)
Key Generation Issues
- Linux/macOS: Ensure
opensslis installed or/dev/urandomexists - Windows: Run PowerShell as Administrator if needed
- Check script permissions:
chmod +x generate_keys.sh
File Upload Errors
- Check file size limits (
UPLOAD_MAX_FILE_SIZE) - Verify file format is supported (PDF, DOCX, DOC, TXT)
- Ensure
UPLOADS_DIRis writable
spaCy Model Issues
# Install required spaCy model
python -m spacy download en_core_web_smDEBUG=true
LOG_LEVEL=DEBUGThis will provide detailed logging for troubleshooting issues.
Start here: FOR-AI-CODING-AGENTS.md
This project has comprehensive documentation for AI coding agents (Claude Code, GitHub Copilot, Cursor, Codeium, etc.):
- Architecture Guide: Read
FOR-AI-CODING-AGENTS.mdfor Clean Architecture patterns, development commands, and implementation details - Knowledge Base: Check
docs/before implementing - contains ADRs, failures, data quirks, and integration docs - Auto-Compound: Document your decisions, failures, and discoveries in
docs/for future developers
Quick commands:
# Search past decisions
grep -r "keyword" docs/decisions/
# Check known mistakes
grep -r "keyword" docs/failures/
# Review quirks/gotchas
grep -r "keyword" docs/data-quirks/See: docs/AI-AGENT-GUIDE.md for complete compounding engineering workflow.
- BrainDrive Main Application - The primary chat application
- BrainDrive Chat With Your Documents Service - Sofisticated RAG