Skip to content

BrainDriveAI/Document-Processing-Service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

70 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

BrainDrive Document Processing Service

License

BrainDrive Document Processing Service is a FastAPI-based microservice for parsing and chunking documents. It serves as the document ingestion component of BrainDrive's Chat with Documents plugin, which uses a Retrieval-Augmented Generation (RAG) pipeline.

Can also be run as a standalone document processing API service following Clean Architecture principles. Processes documents and returns structured chunks with metadata. Features built-in authentication for secure deployment while maintaining simplicity for local development.

πŸš€ Quick Start

1. Clone and Install

git clone https://github.com/BrainDriveAI/Document-Processing-Service.git
cd Document-Processing-Service
poetry install

2. Configure Environment

Local Development

Copy .env.local β†’ .env:

cp .env.local .env

At minimum .env.local should contain:

# Disable authentication (ONLY for local development)
DISABLE_AUTH=false

# Choose authentication method: api_key, jwt, or disabled
AUTH_METHOD=api_key

⚠️ AUTH_API_KEY and JWT_SECRET are automatically generated by the generate_keys scripts.

Production Deployment

Copy .env.production β†’ .env and update values:

cp .env.production .env

3. Generate API Keys

For Linux/macOS:

chmod +x generate_keys.sh
./generate_keys.sh --both

For Windows (PowerShell):

.\generate_keys.ps1 -Both

4. Run the Service

poetry run uvicorn app.main:app --reload --host 0.0.0.0 --port 8080

5. Test the API

# Without authentication (if auth disabled)
curl -X POST "http://localhost:8000/documents/upload" \
  -F "file=@your-document.pdf"

# With API key (if auth enabled)
curl -X POST "http://localhost:8000/documents/upload" \
  -H "X-API-Key: your-generated-api-key" \
  -F "file=@your-document.pdf"

πŸ” Authentication & Security

Authentication Methods

  • API Key: Simple header-based authentication
  • JWT: Token-based authentication for integration
  • Disabled: For local development only

Auth Key Generation

Linux/macOS Users

# Make script executable
chmod +x generate_keys.sh

# Generate both API key and JWT secret
./generate_keys.sh --both

# Generate only API key
./generate_keys.sh --api-key

# Generate only JWT secret
./generate_keys.sh --jwt-secret

# Custom options
./generate_keys.sh --both --length 128 --env-file .env.production

Windows Users (PowerShell)

# Generate both API key and JWT secret
.\generate_keys.ps1 -Both

# Generate only API key
.\generate_keys.ps1 -ApiKey

# Generate only JWT secret
.\generate_keys.ps1 -JwtSecret

# Custom options
.\generate_keys.ps1 -Both -Length 128 -EnvFile .env.production

Security Best Practices

  • βœ… Use DISABLE_AUTH=false in production or from other services / clients
  • βœ… Store API keys securely (environment variables)
  • βœ… Use different keys for development and production
  • βœ… Rotate keys periodically
  • βœ… Never commit .env files to version control

βš™οΈ Configuration

Environment Variables

  • .env.local β†’ minimal template for local development
  • .env.production β†’ full template for production

Example .env.local:

DISABLE_AUTH=false
AUTH_METHOD=api_key
DEBUG=true
LOG_LEVEL=DEBUG

Example .env.production:

DISABLE_AUTH=false
AUTH_METHOD=api_key
AUTH_API_KEY=sk-your-production-api-key
DEBUG=false
LOG_LEVEL=INFO
API_HOST=0.0.0.0
API_PORT=8080
UPLOADS_DIR=/app/data/uploads

Authentication (Production Required)

# Disable authentication (ONLY for local development)
DISABLE_AUTH=false

# Choose authentication method: api_key, jwt, or disabled
AUTH_METHOD=api_key

# API Key Authentication
AUTH_API_KEY=sk-your-generated-api-key-here

# JWT Authentication (alternative to API key)
JWT_SECRET=your-generated-jwt-secret-here
JWT_ALGORITHM=HS256
JWT_EXPIRE_MINUTES=60

Application Settings

# Core App Settings
DEBUG=false
API_HOST=0.0.0.0
API_PORT=8000
LOG_LEVEL=INFO

# File Processing
UPLOADS_DIR=data/uploads
UPLOAD_MAX_FILE_SIZE=104857600  # 100MB
UPLOAD_MAX_PART_SIZE=52428800   # 50MB

# Document Processing
SPACY_MODEL=en_core_web_sm
MAX_CONCURRENT_PROCESSES=4
PROCESSING_TIMEOUT=300

# Performance
MAX_CONCURRENT_PROCESSES=4
PROCESSING_TIMEOUT=300

Configuration Examples

Local Development (.env)

# Disable auth for local development
DISABLE_AUTH=true
DEBUG=true
LOG_LEVEL=DEBUG
UPLOADS_DIR=data/uploads

Production with API Key (.env)

# Authentication
DISABLE_AUTH=false
AUTH_METHOD=api_key
AUTH_API_KEY=sk-1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6p7q8r9s0t1u2v3w4x5y6z

# Application
DEBUG=false
LOG_LEVEL=INFO
API_HOST=0.0.0.0
API_PORT=8000

# File processing
UPLOADS_DIR=/app/uploads
UPLOAD_MAX_FILE_SIZE=104857600

Docker/Cloud Deployment (.env)

# Authentication
DISABLE_AUTH=false
AUTH_METHOD=api_key
AUTH_API_KEY=sk-prod-your-secure-api-key-here

# Application
DEBUG=false
LOG_LEVEL=INFO
API_HOST=0.0.0.0
API_PORT=8080

# Cloud-specific paths
UPLOADS_DIR=/tmp/uploads
SPACY_MODEL=en_core_web_sm

# Performance tuning for cloud
MAX_CONCURRENT_PROCESSES=2
PROCESSING_TIMEOUT=180

πŸ“Š Project Structure

Document-Processing-Service/
β”œβ”€β”€ app/                               # Application code
β”‚   β”œβ”€β”€ core/                          # Domain layer (business logic)
β”‚   β”‚   β”œβ”€β”€ domain/                    # Business entities, value objects, exceptions
β”‚   β”‚   β”œβ”€β”€ ports/                     # Interfaces/contracts
β”‚   β”‚   └── use_cases/                 # Application business rules
β”‚   β”œβ”€β”€ adapters/                      # External service implementations
β”‚   β”‚   β”œβ”€β”€ document_processor/        # Document processing adapters
β”‚   β”‚   β”œβ”€β”€ token_service/             # Token counting services
β”‚   β”‚   └── auth_service/              # Authentication implementations
β”‚   β”œβ”€β”€ api/                           # API layer (routes, dependencies)
β”‚   β”œβ”€β”€ infrastructure/                # Cross-cutting concerns (logging, metrics)
β”‚   β”œβ”€β”€ config.py                      # Configuration management
β”‚   └── main.py                        # Application entry point
β”‚
β”œβ”€β”€ docs/                              # Documentation
β”‚   β”œβ”€β”€ OWNERS-MANUAL.md              # Complete operations guide (40KB+)
β”‚   β”œβ”€β”€ AI-AGENT-GUIDE.md             # Guide for AI coding agents
β”‚   β”œβ”€β”€ README.md                      # Documentation overview
β”‚   β”œβ”€β”€ COMPOUNDING-SETUP-COMPLETE.md  # Verification checklist
β”‚   β”œβ”€β”€ decisions/                     # Architecture Decision Records (ADRs)
β”‚   β”‚   β”œβ”€β”€ 000-template.md           # ADR template
β”‚   β”‚   └── 001-docling-for-document-processing.md
β”‚   β”œβ”€β”€ failures/                      # Lessons learned (what NOT to do)
β”‚   β”œβ”€β”€ data-quirks/                   # Non-obvious system behavior
β”‚   β”‚   β”œβ”€β”€ 001-windows-huggingface-symlinks.md
β”‚   β”‚   β”œβ”€β”€ 002-git-commit-co-author.md
β”‚   β”‚   └── 003-git-commit-message-format.md
β”‚   β”œβ”€β”€ integrations/                  # External service documentation
β”‚   β”‚   └── docling.md                # Docling integration reference
β”‚   └── docker_deployment.md           # Docker deployment guide
β”‚
β”œβ”€β”€ tests/                             # Tests
β”‚   β”œβ”€β”€ document-upload.py            # Manual API test client
β”‚   β”œβ”€β”€ document-upload.http          # HTTP test file
β”‚   └── file_samples/                 # Sample documents for testing
β”‚
β”œβ”€β”€ FOR-AI-CODING-AGENTS.md           # Architecture guide for AI agents
β”œβ”€β”€ docker-compose.yml                 # Docker Compose configuration
β”œβ”€β”€ docker-compose.prod.yml            # Production Docker Compose
β”œβ”€β”€ Dockerfile                         # Docker image definition
β”œβ”€β”€ generate_keys.sh                   # Key generation (Linux/macOS)
β”œβ”€β”€ generate_keys.ps1                  # Key generation (Windows)
β”œβ”€β”€ pyproject.toml                     # Poetry dependencies
β”œβ”€β”€ .env.local                         # Local dev environment template
β”œβ”€β”€ .env.production                    # Production environment template
└── README.md                          # This file

πŸ“š Documentation Guide

For Developers:

For Operators:

  • Read docs/OWNERS-MANUAL.md for complete operations guide
  • Covers deployment, monitoring, troubleshooting, scaling, disaster recovery

For AI Agents:

πŸ—οΈ Key Features

1. Clean Architecture Compliance

  • Domain Layer: Pure business logic without external dependencies
  • Ports: Interfaces defining contracts
  • Adapters: Implementations of external services
  • Use Cases: Application-specific business rules

2. Document Processing Capabilities

  • Multi-format Support: PDF, DOCX, DOC, TXT
  • Advanced Chunking: Hierarchical, semantic, token-based, adaptive
  • Metadata Extraction: Rich metadata from document structure
  • spaCy Layout Integration: Sophisticated document understanding

3. Secure API Design

  • Authentication: API Key and JWT support
  • RESTful Endpoints: Clean, intuitive API design
  • Async Processing: Handle large documents efficiently
  • Error Handling: Comprehensive error reporting

4. Extensibility

  • Plugin Architecture: Easy to add new processors
  • Strategy Pattern: Swappable chunking strategies
  • Configuration-driven: Flexible processing parameters

🌐 API Endpoints

Authentication Status

GET /                    # Service info (includes auth status)
GET /health             # Health check (no auth required)
GET /docs               # API documentation (no auth required)
GET /metrics            # Prometheus metrics (no auth required)

Document Processing (Authentication Required)

POST /documents/upload  # Process document and return chunks

API Usage Examples

With API Key

curl -X POST "http://localhost:8000/documents/upload" \
  -H "X-API-Key: sk-your-api-key-here" \
  -F "file=@document.pdf"

With JWT Token

curl -X POST "http://localhost:8000/documents/upload" \
  -H "Authorization: Bearer your-jwt-token-here" \
  -F "file=@document.pdf"

Response Format

{
  "document_id": "uuid-here",
  "chunks": [
    {
      "id": "chunk-1",
      "text": "Document content...",
      "metadata": {
        "page": 1,
        "position": 0,
        "token_count": 150
      }
    }
  ],
  "metadata": {
    "total_chunks": 5,
    "processing_time": 2.34,
    "document_type": "pdf"
  }
}

🐳 Docker Support

Build and Run

# Build image
docker build -t braindrive-document-ai .

# Run with environment variables
docker run -p 8000:8000 \
  -e DISABLE_AUTH=false \
  -e API_KEY=your-api-key \
  braindrive-document-ai

Docker Compose

version: '3.8'
services:
  document-ai:
    build: .
    ports:
      - "8000:8000"
    environment:
      - DISABLE_AUTH=false
      - AUTH_API_KEY=sk-your-generated-api-key
      - LOG_LEVEL=INFO
    volumes:
      - ./data:/app/data

πŸ”§ Client Integration

Python Client Example

import aiohttp
import os

class DocumentProcessingClient:
    def __init__(self, base_url: str, api_key: str):
        self.base_url = base_url.rstrip('/')
        self.api_key = api_key
    
    async def process_document(self, file_path: str) -> dict:
        async with aiohttp.ClientSession() as session:
            with open(file_path, 'rb') as f:
                form = aiohttp.FormData()
                form.add_field('file', f, filename=os.path.basename(file_path))
                
                headers = {'X-API-Key': self.api_key}
                async with session.post(
                    f"{self.base_url}/documents/upload",
                    data=form,
                    headers=headers
                ) as response:
                    return await response.json()

# Usage
client = DocumentProcessingClient(
    base_url="http://localhost:8000",
    api_key="sk-your-api-key"
)
result = await client.process_document("document.pdf")

πŸš€ Deployment

Google Cloud Run

# Build and push
gcloud builds submit --tag gcr.io/your-project/braindrive-document-ai

# Deploy
gcloud run deploy braindrive-document-ai \
  --image gcr.io/your-project/braindrive-document-ai \
  --platform managed \
  --region us-central1 \
  --set-env-vars DISABLE_AUTH=false,AUTH_API_KEY=your-secure-key

AWS Lambda (with Mangum)

from mangum import Mangum
from src.main import app

handler = Mangum(app)

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: braindrive-document-ai
spec:
  replicas: 3
  selector:
    matchLabels:
      app: braindrive-document-ai
  template:
    spec:
      containers:
      - name: braindrive-document-ai
        image: your-registry/braindrive-document-ai:latest
        env:
        - name: DISABLE_AUTH
          value: "false"
        - name: AUTH_API_KEY
          valueFrom:
            secretKeyRef:
              name: braindrive-secrets
              key: api-key

πŸ§ͺ Testing

Run Tests

# Install dev dependencies
poetry install --with dev

# Run all tests
poetry run pytest

# Run with coverage
poetry run pytest --cov=src

# Run specific test types
poetry run pytest tests/unit/
poetry run pytest tests/integration/

Test Authentication

# Test without auth (should fail in production)
curl -X POST "http://localhost:8000/documents/upload" \
  -F "file=@test.pdf"

# Test with valid API key
curl -X POST "http://localhost:8000/documents/upload" \
  -H "X-API-Key: your-api-key" \
  -F "file=@test.pdf"

# Test with invalid API key (should return 401)
curl -X POST "http://localhost:8000/documents/upload" \
  -H "X-API-Key: invalid-key" \
  -F "file=@test.pdf"

πŸ” Monitoring & Observability

Structured Logging

  • JSON-formatted logs for easy parsing
  • Request/response correlation IDs
  • Authentication audit logs
  • Performance metrics logging

Metrics (Prometheus)

# Access metrics endpoint
curl http://localhost:8000/metrics

Health Monitoring

# Health check (always accessible)
curl http://localhost:8000/health

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Make changes following clean architecture principles
  4. Add tests for new functionality
  5. Ensure new feature works correctly
  6. Update documentation as needed
  7. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ†˜ Troubleshooting

Common Issues

Authentication Errors (401)

  • Verify DISABLE_AUTH=false is set correctly
  • Check that AUTH_API_KEY is set in environment
  • Ensure client is sending X-API-Key header
  • Verify API key matches exactly (no extra spaces)

Key Generation Issues

  • Linux/macOS: Ensure openssl is installed or /dev/urandom exists
  • Windows: Run PowerShell as Administrator if needed
  • Check script permissions: chmod +x generate_keys.sh

File Upload Errors

  • Check file size limits (UPLOAD_MAX_FILE_SIZE)
  • Verify file format is supported (PDF, DOCX, DOC, TXT)
  • Ensure UPLOADS_DIR is writable

spaCy Model Issues

# Install required spaCy model
python -m spacy download en_core_web_sm

Debug Mode

DEBUG=true
LOG_LEVEL=DEBUG

This will provide detailed logging for troubleshooting issues.

πŸ€– For AI Coding Agents

Start here: FOR-AI-CODING-AGENTS.md

This project has comprehensive documentation for AI coding agents (Claude Code, GitHub Copilot, Cursor, Codeium, etc.):

  • Architecture Guide: Read FOR-AI-CODING-AGENTS.md for Clean Architecture patterns, development commands, and implementation details
  • Knowledge Base: Check docs/ before implementing - contains ADRs, failures, data quirks, and integration docs
  • Auto-Compound: Document your decisions, failures, and discoveries in docs/ for future developers

Quick commands:

# Search past decisions
grep -r "keyword" docs/decisions/

# Check known mistakes
grep -r "keyword" docs/failures/

# Review quirks/gotchas
grep -r "keyword" docs/data-quirks/

See: docs/AI-AGENT-GUIDE.md for complete compounding engineering workflow.


πŸ”— Related Projects

About

Powers the local RAG pipeline in the BrainDrive Chat w/ Docs plugin.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •