BrainDrive Document Processing Service

BrainDrive Document Processing Service is a FastAPI-based microservice for parsing and chunking documents. It serves as the document ingestion component of BrainDrive's Chat with Documents plugin, which uses a Retrieval-Augmented Generation (RAG) pipeline.

Can also be run as a standalone document processing API service following Clean Architecture principles. Processes documents and returns structured chunks with metadata. Features built-in authentication for secure deployment while maintaining simplicity for local development.

🚀 Quick Start

1. Clone and Install

git clone https://github.com/BrainDriveAI/Document-Processing-Service.git
cd Document-Processing-Service
poetry install

2. Configure Environment

Local Development

Copy .env.local → .env:

cp .env.local .env

At minimum .env.local should contain:

# Disable authentication (ONLY for local development)
DISABLE_AUTH=false

# Choose authentication method: api_key, jwt, or disabled
AUTH_METHOD=api_key

⚠️ AUTH_API_KEY and JWT_SECRET are automatically generated by the generate_keys scripts.

Production Deployment

Copy .env.production → .env and update values:

cp .env.production .env

3. Generate API Keys

For Linux/macOS:

chmod +x generate_keys.sh
./generate_keys.sh --both

For Windows (PowerShell):

.\generate_keys.ps1 -Both

4. Run the Service

poetry run uvicorn app.main:app --reload --host 0.0.0.0 --port 8080

5. Test the API

# Without authentication (if auth disabled)
curl -X POST "http://localhost:8000/documents/upload" \
  -F "file=@your-document.pdf"

# With API key (if auth enabled)
curl -X POST "http://localhost:8000/documents/upload" \
  -H "X-API-Key: your-generated-api-key" \
  -F "file=@your-document.pdf"

🔐 Authentication & Security

Authentication Methods

API Key: Simple header-based authentication
JWT: Token-based authentication for integration
Disabled: For local development only

Auth Key Generation

Linux/macOS Users

# Make script executable
chmod +x generate_keys.sh

# Generate both API key and JWT secret
./generate_keys.sh --both

# Generate only API key
./generate_keys.sh --api-key

# Generate only JWT secret
./generate_keys.sh --jwt-secret

# Custom options
./generate_keys.sh --both --length 128 --env-file .env.production

Windows Users (PowerShell)

# Generate both API key and JWT secret
.\generate_keys.ps1 -Both

# Generate only API key
.\generate_keys.ps1 -ApiKey

# Generate only JWT secret
.\generate_keys.ps1 -JwtSecret

# Custom options
.\generate_keys.ps1 -Both -Length 128 -EnvFile .env.production

Security Best Practices

✅ Use DISABLE_AUTH=false in production or from other services / clients
✅ Store API keys securely (environment variables)
✅ Use different keys for development and production
✅ Rotate keys periodically
✅ Never commit .env files to version control

⚙️ Configuration

Environment Variables

.env.local → minimal template for local development
.env.production → full template for production

Example .env.local:

DISABLE_AUTH=false
AUTH_METHOD=api_key
DEBUG=true
LOG_LEVEL=DEBUG

Example .env.production:

DISABLE_AUTH=false
AUTH_METHOD=api_key
AUTH_API_KEY=sk-your-production-api-key
DEBUG=false
LOG_LEVEL=INFO
API_HOST=0.0.0.0
API_PORT=8080
UPLOADS_DIR=/app/data/uploads

Authentication (Production Required)

# Disable authentication (ONLY for local development)
DISABLE_AUTH=false

# Choose authentication method: api_key, jwt, or disabled
AUTH_METHOD=api_key

# API Key Authentication
AUTH_API_KEY=sk-your-generated-api-key-here

# JWT Authentication (alternative to API key)
JWT_SECRET=your-generated-jwt-secret-here
JWT_ALGORITHM=HS256
JWT_EXPIRE_MINUTES=60

Application Settings

# Core App Settings
DEBUG=false
API_HOST=0.0.0.0
API_PORT=8000
LOG_LEVEL=INFO

# File Processing
UPLOADS_DIR=data/uploads
UPLOAD_MAX_FILE_SIZE=104857600  # 100MB
UPLOAD_MAX_PART_SIZE=52428800   # 50MB

# Document Processing
SPACY_MODEL=en_core_web_sm
MAX_CONCURRENT_PROCESSES=4
PROCESSING_TIMEOUT=300

# Performance
MAX_CONCURRENT_PROCESSES=4
PROCESSING_TIMEOUT=300

Configuration Examples

Local Development (`.env`)

# Disable auth for local development
DISABLE_AUTH=true
DEBUG=true
LOG_LEVEL=DEBUG
UPLOADS_DIR=data/uploads

Production with API Key (`.env`)

# Authentication
DISABLE_AUTH=false
AUTH_METHOD=api_key
AUTH_API_KEY=sk-1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6p7q8r9s0t1u2v3w4x5y6z

# Application
DEBUG=false
LOG_LEVEL=INFO
API_HOST=0.0.0.0
API_PORT=8000

# File processing
UPLOADS_DIR=/app/uploads
UPLOAD_MAX_FILE_SIZE=104857600

Docker/Cloud Deployment (`.env`)

# Authentication
DISABLE_AUTH=false
AUTH_METHOD=api_key
AUTH_API_KEY=sk-prod-your-secure-api-key-here

# Application
DEBUG=false
LOG_LEVEL=INFO
API_HOST=0.0.0.0
API_PORT=8080

# Cloud-specific paths
UPLOADS_DIR=/tmp/uploads
SPACY_MODEL=en_core_web_sm

# Performance tuning for cloud
MAX_CONCURRENT_PROCESSES=2
PROCESSING_TIMEOUT=180

📊 Project Structure

Document-Processing-Service/
├── app/                               # Application code
│   ├── core/                          # Domain layer (business logic)
│   │   ├── domain/                    # Business entities, value objects, exceptions
│   │   ├── ports/                     # Interfaces/contracts
│   │   └── use_cases/                 # Application business rules
│   ├── adapters/                      # External service implementations
│   │   ├── document_processor/        # Document processing adapters
│   │   ├── token_service/             # Token counting services
│   │   └── auth_service/              # Authentication implementations
│   ├── api/                           # API layer (routes, dependencies)
│   ├── infrastructure/                # Cross-cutting concerns (logging, metrics)
│   ├── config.py                      # Configuration management
│   └── main.py                        # Application entry point
│
├── docs/                              # Documentation
│   ├── OWNERS-MANUAL.md              # Complete operations guide (40KB+)
│   ├── AI-AGENT-GUIDE.md             # Guide for AI coding agents
│   ├── README.md                      # Documentation overview
│   ├── COMPOUNDING-SETUP-COMPLETE.md  # Verification checklist
│   ├── decisions/                     # Architecture Decision Records (ADRs)
│   │   ├── 000-template.md           # ADR template
│   │   └── 001-docling-for-document-processing.md
│   ├── failures/                      # Lessons learned (what NOT to do)
│   ├── data-quirks/                   # Non-obvious system behavior
│   │   ├── 001-windows-huggingface-symlinks.md
│   │   ├── 002-git-commit-co-author.md
│   │   └── 003-git-commit-message-format.md
│   ├── integrations/                  # External service documentation
│   │   └── docling.md                # Docling integration reference
│   └── docker_deployment.md           # Docker deployment guide
│
├── tests/                             # Tests
│   ├── document-upload.py            # Manual API test client
│   ├── document-upload.http          # HTTP test file
│   └── file_samples/                 # Sample documents for testing
│
├── FOR-AI-CODING-AGENTS.md           # Architecture guide for AI agents
├── docker-compose.yml                 # Docker Compose configuration
├── docker-compose.prod.yml            # Production Docker Compose
├── Dockerfile                         # Docker image definition
├── generate_keys.sh                   # Key generation (Linux/macOS)
├── generate_keys.ps1                  # Key generation (Windows)
├── pyproject.toml                     # Poetry dependencies
├── .env.local                         # Local dev environment template
├── .env.production                    # Production environment template
└── README.md                          # This file

📚 Documentation Guide

For Developers:

Start with FOR-AI-CODING-AGENTS.md for architecture and patterns
Check docs/decisions/ for past architectural decisions

For Operators:

Read docs/OWNERS-MANUAL.md for complete operations guide
Covers deployment, monitoring, troubleshooting, scaling, disaster recovery

For AI Agents:

See FOR-AI-CODING-AGENTS.md for architecture guide
See docs/AI-AGENT-GUIDE.md for compounding engineering workflow
Search docs/ for past decisions, failures, quirks before implementing

🏗️ Key Features

1. Clean Architecture Compliance

Domain Layer: Pure business logic without external dependencies
Ports: Interfaces defining contracts
Adapters: Implementations of external services
Use Cases: Application-specific business rules

2. Document Processing Capabilities

Multi-format Support: PDF, DOCX, DOC, TXT
Advanced Chunking: Hierarchical, semantic, token-based, adaptive
Metadata Extraction: Rich metadata from document structure
spaCy Layout Integration: Sophisticated document understanding

3. Secure API Design

Authentication: API Key and JWT support
RESTful Endpoints: Clean, intuitive API design
Async Processing: Handle large documents efficiently
Error Handling: Comprehensive error reporting

4. Extensibility

Plugin Architecture: Easy to add new processors
Strategy Pattern: Swappable chunking strategies
Configuration-driven: Flexible processing parameters

🌐 API Endpoints

Authentication Status

GET /                    # Service info (includes auth status)
GET /health             # Health check (no auth required)
GET /docs               # API documentation (no auth required)
GET /metrics            # Prometheus metrics (no auth required)

Document Processing (Authentication Required)

POST /documents/upload  # Process document and return chunks

API Usage Examples

With API Key

curl -X POST "http://localhost:8000/documents/upload" \
  -H "X-API-Key: sk-your-api-key-here" \
  -F "file=@document.pdf"

With JWT Token

curl -X POST "http://localhost:8000/documents/upload" \
  -H "Authorization: Bearer your-jwt-token-here" \
  -F "file=@document.pdf"

Response Format

{
  "document_id": "uuid-here",
  "chunks": [
    {
      "id": "chunk-1",
      "text": "Document content...",
      "metadata": {
        "page": 1,
        "position": 0,
        "token_count": 150
      }
    }
  ],
  "metadata": {
    "total_chunks": 5,
    "processing_time": 2.34,
    "document_type": "pdf"
  }
}

🐳 Docker Support

Build and Run

# Build image
docker build -t braindrive-document-ai .

# Run with environment variables
docker run -p 8000:8000 \
  -e DISABLE_AUTH=false \
  -e API_KEY=your-api-key \
  braindrive-document-ai

Docker Compose

version: '3.8'
services:
  document-ai:
    build: .
    ports:
      - "8000:8000"
    environment:
      - DISABLE_AUTH=false
      - AUTH_API_KEY=sk-your-generated-api-key
      - LOG_LEVEL=INFO
    volumes:
      - ./data:/app/data

🔧 Client Integration

Python Client Example

import aiohttp
import os

class DocumentProcessingClient:
    def __init__(self, base_url: str, api_key: str):
        self.base_url = base_url.rstrip('/')
        self.api_key = api_key
    
    async def process_document(self, file_path: str) -> dict:
        async with aiohttp.ClientSession() as session:
            with open(file_path, 'rb') as f:
                form = aiohttp.FormData()
                form.add_field('file', f, filename=os.path.basename(file_path))
                
                headers = {'X-API-Key': self.api_key}
                async with session.post(
                    f"{self.base_url}/documents/upload",
                    data=form,
                    headers=headers
                ) as response:
                    return await response.json()

# Usage
client = DocumentProcessingClient(
    base_url="http://localhost:8000",
    api_key="sk-your-api-key"
)
result = await client.process_document("document.pdf")

🚀 Deployment

Google Cloud Run

# Build and push
gcloud builds submit --tag gcr.io/your-project/braindrive-document-ai

# Deploy
gcloud run deploy braindrive-document-ai \
  --image gcr.io/your-project/braindrive-document-ai \
  --platform managed \
  --region us-central1 \
  --set-env-vars DISABLE_AUTH=false,AUTH_API_KEY=your-secure-key

AWS Lambda (with Mangum)

from mangum import Mangum
from src.main import app

handler = Mangum(app)

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: braindrive-document-ai
spec:
  replicas: 3
  selector:
    matchLabels:
      app: braindrive-document-ai
  template:
    spec:
      containers:
      - name: braindrive-document-ai
        image: your-registry/braindrive-document-ai:latest
        env:
        - name: DISABLE_AUTH
          value: "false"
        - name: AUTH_API_KEY
          valueFrom:
            secretKeyRef:
              name: braindrive-secrets
              key: api-key

🧪 Testing

Run Tests

# Install dev dependencies
poetry install --with dev

# Run all tests
poetry run pytest

# Run with coverage
poetry run pytest --cov=src

# Run specific test types
poetry run pytest tests/unit/
poetry run pytest tests/integration/

Test Authentication

# Test without auth (should fail in production)
curl -X POST "http://localhost:8000/documents/upload" \
  -F "file=@test.pdf"

# Test with valid API key
curl -X POST "http://localhost:8000/documents/upload" \
  -H "X-API-Key: your-api-key" \
  -F "file=@test.pdf"

# Test with invalid API key (should return 401)
curl -X POST "http://localhost:8000/documents/upload" \
  -H "X-API-Key: invalid-key" \
  -F "file=@test.pdf"

🔍 Monitoring & Observability

Structured Logging

JSON-formatted logs for easy parsing
Request/response correlation IDs
Authentication audit logs
Performance metrics logging

Metrics (Prometheus)

# Access metrics endpoint
curl http://localhost:8000/metrics

Health Monitoring

# Health check (always accessible)
curl http://localhost:8000/health

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature/your-feature
Make changes following clean architecture principles
Add tests for new functionality
Ensure new feature works correctly
Update documentation as needed
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Troubleshooting

Common Issues

Authentication Errors (401)

Verify DISABLE_AUTH=false is set correctly
Check that AUTH_API_KEY is set in environment
Ensure client is sending X-API-Key header
Verify API key matches exactly (no extra spaces)

Key Generation Issues

Linux/macOS: Ensure openssl is installed or /dev/urandom exists
Windows: Run PowerShell as Administrator if needed
Check script permissions: chmod +x generate_keys.sh

File Upload Errors

Check file size limits (UPLOAD_MAX_FILE_SIZE)
Verify file format is supported (PDF, DOCX, DOC, TXT)
Ensure UPLOADS_DIR is writable

spaCy Model Issues

# Install required spaCy model
python -m spacy download en_core_web_sm

Debug Mode

DEBUG=true
LOG_LEVEL=DEBUG

This will provide detailed logging for troubleshooting issues.

🤖 For AI Coding Agents

Start here: FOR-AI-CODING-AGENTS.md

This project has comprehensive documentation for AI coding agents (Claude Code, GitHub Copilot, Cursor, Codeium, etc.):

Architecture Guide: Read FOR-AI-CODING-AGENTS.md for Clean Architecture patterns, development commands, and implementation details
Knowledge Base: Check docs/ before implementing - contains ADRs, failures, data quirks, and integration docs
Auto-Compound: Document your decisions, failures, and discoveries in docs/ for future developers

Quick commands:

# Search past decisions
grep -r "keyword" docs/decisions/

# Check known mistakes
grep -r "keyword" docs/failures/

# Review quirks/gotchas
grep -r "keyword" docs/data-quirks/

See: docs/AI-AGENT-GUIDE.md for complete compounding engineering workflow.

🔗 Related Projects

BrainDrive Main Application - The primary chat application
BrainDrive Chat With Your Documents Service - Sofisticated RAG

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
app		app
docs		docs
ngnix		ngnix
prometheus		prometheus
spacy_models		spacy_models
.dockerignore		.dockerignore
.env.local		.env.local
.env.production		.env.production
.gitignore		.gitignore
Dockerfile		Dockerfile
FOR-AI-CODING-AGENTS.md		FOR-AI-CODING-AGENTS.md
LICENSE		LICENSE
README.md		README.md
docker-build.ps1		docker-build.ps1
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
generate_keys.ps1		generate_keys.ps1
generate_keys.sh		generate_keys.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
start.ps1		start.ps1
verify_models.py		verify_models.py

License

BrainDriveAI/Document-Processing-Service

Folders and files

Latest commit

History

Repository files navigation