Skip to content

Production-ready orchestration hub for a self-aware market data platform. Manages 6 microservices (schema registry, SDK, TimescaleDB store, validator, IBKR provider, orchestrator) with Docker Compose, automated version tracking, Streamlit cockpit, and Prometheus/Grafana monitoring.

License

Notifications You must be signed in to change notification settings

mjdevaccount/market_data_infra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏗️ Market Data Infrastructure

Status Phase License Docker Python Platform AI Kubernetes Ready

Self-healing, AI-powered, production-grade market data platform with autonomous deployment intelligence

FeaturesQuick StartArchitectureDeploymentDocumentationContributing


📋 Overview

market_data_infra is an enterprise-grade infrastructure orchestration hub for a self-aware, self-healing market data platform. It combines microservices architecture, GitOps principles, and AI-powered deployment intelligence to deliver production-ready market data infrastructure.

🎯 Core Capabilities

  • 🤖 AI-Powered Policy Tuning - Machine learning-based threshold optimization from production metrics
  • 🚀 Autonomous Deployments - Self-healing canary rollouts with automatic rollback
  • 📊 SLO-Based Guardrails - Service Level Objective monitoring with auto-disable/reinstate
  • 🎭 Canary Intelligence - Adaptive step-wise rollouts with Prometheus guard evaluation
  • 🧠 Self-Aware Platform - Automatic version tracking, drift detection, and health monitoring
  • 🛩️ Interactive Cockpit - Real-time Streamlit control panel with live metrics
  • 🔄 GitOps Automation - Repository dispatch integration for seamless CI/CD
  • 📈 Comprehensive Observability - Prometheus, Grafana, and custom metrics

Platform Components

The infrastructure orchestrates 12+ services across multiple layers:

Core Platform Services

Service Repository Purpose Port
Registry schema-registry-service Central schema store 8080
Core market-data-core SDK, events, schema publisher 8081
Store market-data-store TimescaleDB backend, drift reporter 8082
Pipeline market_data_pipeline Validator, enforcer, rules engine 8083
IBKR market_data_ibkr Market data provider adapters 8084
Orchestrator market_data_orchestrator Dashboards, telemetry 8501

Infrastructure & Control Plane

Service Purpose Port
PostgreSQL/TimescaleDB Time-series database 5433
Prometheus Metrics collection & alerting 9090
Grafana Visualization dashboards 3000
Loki Log aggregation 3100
Redis Caching & pub/sub 6379
Cockpit Platform control panel (Streamlit) 8505

Deployment Intelligence Layer (Phase 13+)

Service Purpose Port
infra_web Deployment orchestration & webhooks 8000
ai_advisor AI-powered policy tuning 8086

✨ Features

🚀 Deployment Intelligence (Phase 13)

Phase 13.2: Runtime Deploy & Self-Update

  • Manual Deployment - Controlled service deployments via API
  • Health Verification - Post-deploy health checks with automatic rollback
  • State Management - Last-known-good version tracking
  • Prometheus Integration - Deployment metrics (attempts, success, failures)

Phase 13.3: Auto-Deploy (Webhook + Policy Enforcement)

  • GitHub Webhook Integration - Automatic deployment on release
  • Policy-Driven Decisions - auto_deploy: true/false per service
  • Threaded Queue - Serialized deployment processing
  • HMAC Signature Validation - Secure webhook authentication
  • Automatic Rollback - Health-guard triggered rollbacks on failure

Phase 13.4: Autonomous Scheduler

  • Background Registry Scanner - Continuous version monitoring
  • Rate Limiting - Configurable deploy attempts per hour
  • Quiet Hours - Maintenance window awareness
  • Per-Service Policies - Independent auto-deploy configuration
  • Live Control API - Pause/resume/force-run capabilities

Phase 13.5: Canary Rollouts

  • Step-Wise Deployment - Progressive rollout (10% → 50% → 100%)
  • Guard Evaluation - PromQL-based health checks between steps
  • Automatic Rollback - Guard failure triggers instant revert
  • Cockpit Integration - Real-time canary monitoring dashboard
  • Prometheus Metrics - Rollout duration, step tracking, guard failures

Phase 13.6: Adaptive Canary Intelligence

  • Precheck Guards - Cluster health validation before deployment
  • Step-Specific Guards - Progressive threshold tightening
  • Auto-Disable on Failure - Service quarantine after 3 consecutive failures
  • Failure Tracking - Per-service failure counter
  • Insights Dashboard - Historical analysis and trend visualization

Phase 13.7: SLO-Based Rollback + Auto-Reinstate

  • SLO Definitions - Availability, latency, error rate targets
  • Continuous Monitoring - Every 60s SLO evaluation
  • Auto-Disable on Breach - 3 violations → service disabled
  • Auto-Reinstate on Recovery - 3 consecutive passes → re-enabled
  • SLO Dashboard - Real-time compliance tracking

Phase 13.8: AI Advisor Service

  • Heuristic Analysis - Rule-based threshold optimization
  • LLM-Ready Architecture - OpenAI/Ollama integration framework
  • Prometheus-Driven - 30-minute metric window analysis
  • Policy Suggestions - Latency & error rate threshold tuning
  • Confidence Scoring - 0-1 confidence per suggestion
  • Human-in-the-Loop - Manual approval before applying changes

🎯 Orchestration & Deployment

  • Profile-Based Management - Start services individually or as complete stack
  • Health Check Dependencies - Services wait for dependencies before starting
  • Network Isolation - Dedicated Docker network (mdnet) for service communication
  • Volume Management - Persistent data storage for PostgreSQL
  • Environment Configuration - Centralized .env for all services
  • Docker Compose Profiles - infra, core, store, pipeline, orchestrator, cockpit

🤖 Automation & Control Plane

  • Automatic Version Tracking - Downstream releases update central registry automatically
  • Pinned Compose Generation - Reproducible deployments with frozen image tags
  • Repository Dispatch Integration - Services notify infrastructure on release
  • Workflow Automation - GitHub Actions for all automation tasks
  • Policy Versioning - Timestamped backups with diff viewer
  • Config Hot-Reload - Zero-downtime policy updates

📊 Monitoring & Observability

  • Live Dashboard - Auto-generated status with CI/PyPI badges (hourly refresh)
  • Interactive Cockpit - 10+ Streamlit panels for platform control
    • Platform Overview
    • Streaming Health
    • Runtime Audit
    • Auto-Deploy Control
    • Policy Backups & Restore
    • Scheduler Control
    • Canary Rollouts
    • Canary Insights
    • SLO Status
    • AI Suggestions
  • Prometheus Metrics - 50+ custom metrics across deployment lifecycle
  • Grafana Dashboards - Pre-configured visualization dashboards
  • Alert Rules - 20+ alert rules for proactive monitoring
  • Version Mismatch Detection - Automatic detection of registry vs PyPI drift
  • Log Aggregation - Loki integration for centralized logging

🛠️ Developer Experience

  • Makefile Commands - Simple make targets for common operations
  • Validation Scripts - Automated health checks and smoke tests
  • Comprehensive Documentation - 20+ guides for setup, usage, and troubleshooting
  • Local Development - Full stack runs locally with Docker
  • CI/CD Ready - Workflows for testing and deployment
  • Hot Reload Support - Development mode with live code updates

🚀 Quick Start

Prerequisites

  • Docker 24.0+ with Docker Compose v2
  • Git for cloning repositories
  • Make (optional, but recommended)
  • Python 3.11+ (for local scripts)
  • curl and jq (for API testing)

Environment Setup

  1. Copy the environment template:

    cp github-runner.env.example .env
  2. Configure required variables:

    # Essential
    REPO_ACCESS_TOKEN=ghp_your_github_token_here
    GITHUB_WEBHOOK_SECRET=your_webhook_secret_here
    
    # Optional (defaults provided)
    POSTGRES_PASSWORD=postgres
    GRAFANA_ADMIN_PASSWORD=admin
    OPENAI_API_KEY=sk_your_openai_key  # For AI Advisor LLM mode
  3. Generate GitHub PAT:

Installation

  1. Clone the infrastructure repository:
git clone https://github.com/mjdevaccount/market_data_infra.git
cd market_data_infra
  1. Clone sibling service repositories:
cd ..
git clone https://github.com/mjdevaccount/schema-registry-service.git
git clone https://github.com/mjdevaccount/market-data-core.git
git clone https://github.com/mjdevaccount/market-data-store.git
git clone https://github.com/mjdevaccount/market_data_pipeline.git
git clone https://github.com/mjdevaccount/market_data_ibkr.git
git clone https://github.com/mjdevaccount/market_data_orchestrator.git
git clone https://github.com/mjdevaccount/market_data_cockpit.git
cd market_data_infra

Expected directory structure:

parent_directory/
├── market_data_infra/          # ⭐ This repo
├── schema-registry-service/
├── market_data_core/
├── market_data_store/
├── market_data_pipeline/
├── market_data_ibkr/
├── market_data_orchestrator/
└── market_data_cockpit/

Running the Platform

Full Stack (Recommended)

# Start everything
make up

# Or with Docker Compose directly
docker compose --profile infra --profile core --profile store \
  --profile pipeline --profile orchestrator --profile cockpit up -d

Selective Startup

# Infrastructure only (database, monitoring, control plane)
make up-infra

# Core platform services
docker compose --profile core --profile store up -d

# Processing layer
docker compose --profile pipeline up -d

# Control panels
make up-cockpit

Verification

# Check service health
docker ps

# Test deployment API
curl http://localhost:8000/health

# Test AI Advisor
curl http://localhost:8086/health

# Access dashboards
open http://localhost:8505  # Cockpit UI
open http://localhost:3000  # Grafana (admin/admin)
open http://localhost:9090  # Prometheus
open http://localhost:8000/metrics  # Deployment metrics

First Deployment Test

# Generate AI suggestions
curl -X POST http://localhost:8000/runtime/ai/scan

# View suggestions
curl http://localhost:8000/runtime/ai/suggestions | jq

# Trigger a manual deployment
curl -X POST http://localhost:8000/runtime/deploy/execute/cockpit/v1.0.0

# Check deployment status
curl http://localhost:8000/runtime/deploy/status | jq

💡 Pro Tip: Start with the Cockpit UI for a visual introduction to all platform capabilities!

Stopping the Platform

# Stop all services
make down

# Stop and remove volumes (⚠️ DATA LOSS)
make nuke

🏛️ Architecture

System Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                     MARKET DATA PLATFORM (Phase 13)                          │
│                                                                               │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                    DEPLOYMENT INTELLIGENCE LAYER                       │  │
│  │                                                                         │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐  ┌────────────┐ │  │
│  │  │ AI Advisor  │  │ infra_web   │  │   Cockpit    │  │ Prometheus │ │  │
│  │  │             │  │             │  │              │  │            │ │  │
│  │  │ • Heuristic │◀─│ • Webhooks  │◀─│ • 10 Panels  │◀─│ • Metrics  │ │  │
│  │  │ • LLM Ready │  │ • Auto-     │  │ • Real-time  │  │ • Alerts   │ │  │
│  │  │ • Suggest   │  │   Deploy    │  │ • Control    │  │ • SLOs     │ │  │
│  │  │   Tuning    │  │ • Canary    │  │ • Insights   │  │ • Guards   │ │  │
│  │  └─────────────┘  └─────────────┘  └──────────────┘  └────────────┘ │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                    ▼                                          │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                        APPLICATION LAYER                               │  │
│  │                                                                         │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                │  │
│  │  │  PROVIDERS   │  │   PIPELINE   │  │ ORCHESTRATOR │                │  │
│  │  │              │  │              │  │              │                │  │
│  │  │ market_data  │─▶│ market-data- │─▶│ market-data- │                │  │
│  │  │    _ibkr     │  │   pipeline   │  │ orchestrator │                │  │
│  │  │              │  │              │  │              │                │  │
│  │  │ • IBKR feed  │  │ • Validator  │  │ • Dashboards │                │  │
│  │  │ • Synthetic  │  │ • Enforcer   │  │ • Telemetry  │                │  │
│  │  │ • Replay     │  │ • Rules      │  │ • Control    │                │  │
│  │  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘                │  │
│  │         │                 │                 │                          │  │
│  │         └─────────┬───────┴─────────┬───────┘                          │  │
│  │                   ▼                 ▼                                   │  │
│  │         ┌──────────────────────────────────┐                           │  │
│  │         │     CORE SDK & REGISTRY          │                           │  │
│  │         │                                  │                           │  │
│  │         │  market-data-core                │                           │  │
│  │         │  schema-registry-service         │                           │  │
│  │         └──────────────┬───────────────────┘                           │  │
│  │                        │                                                │  │
│  │                        ▼                                                │  │
│  │         ┌──────────────────────────────────┐                           │  │
│  │         │      DATA STORE & DATABASE       │                           │  │
│  │         │                                  │                           │  │
│  │         │  market-data-store               │                           │  │
│  │         │  PostgreSQL/TimescaleDB          │                           │  │
│  │         └──────────────────────────────────┘                           │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                               │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                  INFRASTRUCTURE & OBSERVABILITY                        │  │
│  │                                                                         │  │
│  │  Prometheus → Grafana → Loki → Redis → Cockpit                        │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Deployment Intelligence Flow

┌─────────────────┐
│ GitHub Release  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Webhook Event   │ ──HMAC──▶ Signature Validation
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Auto-Scheduler  │ ◀──── Continuous Registry Scan (60s)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Policy Check    │ ──auto_deploy: true?──▶ Queue or Manual
└────────┬────────┘
         │ YES
         ▼
┌─────────────────┐
│ Precheck Guards │ ──Cluster Health? Alerts?──▶ Abort if unsafe
└────────┬────────┘
         │ PASS
         ▼
┌─────────────────┐
│ Canary Deploy   │
│  Step 1: 10%    │ ──Wait 60s──▶ Evaluate Guards
│  Step 2: 50%    │ ──Wait 120s──▶ Evaluate Guards
│  Step 3: 100%   │ ──Wait 180s──▶ Evaluate Guards
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
   PASS      FAIL
    │         │
    ▼         ▼
┌────────┐  ┌─────────────┐
│Success │  │  Rollback   │
│        │  │  + Disable  │
└────┬───┘  └──────┬──────┘
     │             │
     ▼             ▼
┌──────────────────────┐
│ SLO Monitoring       │ ◀──── Every 60s
│ (Continuous)         │
│                      │
│ 3 violations → Disable
│ 3 passes → Reinstate │
└──────────────────────┘
         │
         ▼
┌──────────────────────┐
│ AI Advisor           │
│ Analyzes Metrics     │
│ Suggests Tuning      │
│                      │
│ Human Review         │
│ Apply Changes        │
└──────────────────────┘

Data Flow

1. INGESTION
   market_data_ibkr → market-data-core
   (Provider feeds data to SDK)

2. VALIDATION
   market-data-core → market-data-pipeline
   (SDK publishes to pipeline for validation)

3. ENFORCEMENT
   market-data-pipeline → schema-registry-service
   (Pipeline checks against registered schemas)

4. STORAGE
   market-data-pipeline → market-data-store → postgres
   (Valid data persisted to TimescaleDB)

5. DRIFT DETECTION
   market-data-store → schema-registry-service
   (Store reports schema drift)

6. MONITORING
   All services → prometheus → grafana → cockpit
   (Metrics collected, visualized, and controlled)

7. DEPLOYMENT
   infra_web → docker compose → health checks → prometheus guards
   (Automated deployment with health verification)

8. AI OPTIMIZATION
   ai_advisor → prometheus → suggestions → infra_web → policies
   (ML-driven threshold tuning)

🎮 Deployment Intelligence

Canary Deployment Workflow

# 1. View current deployment policies
curl http://localhost:8000/runtime/deploy/plan | jq

# 2. Start a canary rollout
curl -X POST http://localhost:8000/runtime/deploy/canary/start \
  -H "Content-Type: application/json" \
  -d '{"service":"cockpit","tag":"v1.2.3"}' | jq

# 3. Monitor progress
curl http://localhost:8000/runtime/deploy/canary/status | jq

# 4. Abort if needed
curl -X POST http://localhost:8000/runtime/deploy/canary/abort/cockpit

AI-Powered Policy Tuning

# 1. Scan services for optimization opportunities
curl -X POST http://localhost:8000/runtime/ai/scan | jq

# 2. View AI suggestions
curl http://localhost:8000/runtime/ai/suggestions | jq

# Example response:
# {
#   "suggestions": {
#     "cockpit": [
#       {
#         "kind": "threshold_adjust",
#         "path": "services.cockpit.rollout.guards[p95_latency_ms].threshold",
#         "from_value": 600.0,
#         "to_value": 520.5,
#         "rationale": "Observed median p95=473.2ms over 30m; nudging threshold toward 110% of observed.",
#         "confidence": 0.78
#       }
#     ]
#   }
# }

# 3. Apply suggestions (via Cockpit UI or API)
curl -X POST http://localhost:8000/runtime/ai/apply/cockpit

# 4. Or reject
curl -X POST http://localhost:8000/runtime/ai/reject/cockpit

SLO Monitoring

# Check SLO compliance
curl http://localhost:8000/runtime/deploy/canary/status | jq '.rollouts'

# View SLO health metrics
curl http://localhost:8000/metrics | grep infra_slo_health_ratio

# Example metrics:
# infra_slo_health_ratio{service="cockpit",slo="availability"} 1.0
# infra_slo_health_ratio{service="cockpit",slo="latency_p95"} 1.0
# infra_slo_health_ratio{service="cockpit",slo="error_rate"} 1.0

Auto-Scheduler Control

# Check scheduler status
curl http://localhost:8000/runtime/deploy/auto/status | jq

# Pause auto-deployments
curl -X POST http://localhost:8000/runtime/deploy/auto/pause

# Resume
curl -X POST http://localhost:8000/runtime/deploy/auto/resume

# Force immediate scan
curl -X POST http://localhost:8000/runtime/deploy/auto/run

# Update scan interval
curl -X POST http://localhost:8000/runtime/deploy/auto/interval/120

📁 Project Structure

market_data_infra/
├── .github/
│   └── workflows/
│       ├── on_downstream_release.yml       # Release notifications
│       ├── platform_rebuild.yml            # Pinned compose generation
│       ├── dashboard_sync.yml              # Hourly dashboard refresh
│       ├── nightly_version_check.yml       # Daily PyPI validation
│       └── phase13_*.yml                   # Deployment automation tests
│
├── ai_advisor/                             # 🆕 AI-powered policy tuning
│   ├── src/
│   │   └── main.py                         # FastAPI service
│   ├── Dockerfile
│   └── requirements.txt
│
├── build/
│   └── docker-compose.platform.yml         # Auto-generated pinned compose
│
├── cockpit/                                # 🆕 Enhanced control panel
│   ├── ui/
│   │   ├── auto_deploy.py                  # Auto-deploy control
│   │   ├── policy_backups.py               # Policy backup/restore
│   │   ├── scheduler_control.py            # Scheduler management
│   │   ├── canary_rollouts.py              # Canary monitoring
│   │   ├── canary_insights.py              # Analytics dashboard
│   │   ├── slo_status.py                   # SLO compliance
│   │   └── ai_suggestions.py               # AI recommendations
│   ├── app.py                              # Main Streamlit app
│   └── Dockerfile
│
├── configs/                                # 🆕 Declarative configuration
│   ├── deploy_policies.yaml                # Deployment rules
│   ├── deploy_scheduler.yaml               # Scheduler config
│   └── backups/                            # Versioned policy backups
│
├── docker/
│   ├── compose/                            # Service-specific compose files
│   └── initdb.d/                           # PostgreSQL init scripts
│
├── docs/
│   ├── AUTOMATION_INFRASTRUCTURE.md
│   ├── PHASE_4_CONTROL_PLANE.md
│   ├── PHASE_13_2_DEPLOY_GUIDE.md          # 🆕 Manual deployment
│   ├── PHASE_13_3_AUTO_DEPLOY_GUIDE.md     # 🆕 Auto-deploy setup
│   ├── PHASE_13_4_SCHEDULER_GUIDE.md       # 🆕 Scheduler config
│   ├── PHASE_13_5_CANARY_GUIDE.md          # 🆕 Canary rollouts
│   ├── PHASE_13_6_ADAPTIVE_GUIDE.md        # 🆕 Adaptive intelligence
│   ├── PHASE_13_7_SLO_GUIDE.md             # 🆕 SLO monitoring
│   ├── PHASE_13_8_AI_ADVISOR_GUIDE.md      # 🆕 AI tuning
│   ├── PLATFORM_OVERVIEW.md
│   ├── QUICK_REFERENCE.md
│   └── TROUBLESHOOTING.md
│
├── monitoring/
│   ├── grafana/
│   │   ├── dashboards/
│   │   └── provisioning/
│   └── prometheus/
│       ├── prometheus.yml
│       ├── alerts.yml
│       └── rules/                          # 🆕 Deployment alert rules
│           ├── alerts-deploy-canary.yml
│           ├── alerts-canary-intel.yml
│           ├── alerts-slo.yml
│           └── alerts-ai-advisor.yml
│
├── registry/
│   └── versions.json                       # Version registry (auto-updated)
│
├── scripts/
│   ├── deploy_latest.sh                    # 🆕 Deployment script
│   ├── rollback.sh                         # 🆕 Rollback script
│   ├── verify_stack.sh                     # 🆕 Health verification
│   ├── auto_deploy_test.sh                 # 🆕 Webhook simulator
│   ├── simulate_registry_bump.sh           # 🆕 Registry test
│   ├── dashboard_generator.py
│   ├── registry_updater.py
│   └── health-check.sh
│
├── src/
│   └── infra_web/                          # 🆕 Deployment orchestration
│       ├── routes/
│       │   ├── deploy.py                   # Manual deploy API
│       │   ├── deploy_webhook.py           # Webhook handler
│       │   ├── deploy_policies.py          # Policy management
│       │   ├── deploy_auto.py              # Scheduler control
│       │   ├── deploy_canary.py            # Canary API
│       │   └── ai.py                       # AI Advisor API
│       ├── services/
│       │   ├── deploy_controller.py        # Deploy logic
│       │   ├── webhook_controller.py       # Webhook validation
│       │   ├── autodeploy_worker.py        # Background worker
│       │   ├── auto_scheduler.py           # Registry scanner
│       │   ├── canary_engine.py            # Canary orchestration
│       │   ├── guard_evaluator.py          # Prometheus guards
│       │   └── ai_client.py                # AI Advisor client
│       ├── state/
│       │   ├── deploy_state.py             # Deployment state
│       │   └── scheduler_state.py          # Scheduler state
│       ├── main.py                         # FastAPI app
│       ├── Dockerfile
│       └── requirements.txt
│
├── state/                                  # 🆕 Runtime state
│   └── last_good.json                      # Last-known-good versions
│
├── docker-compose.yml                      # Main orchestration
├── Makefile                                # Management commands
├── DASHBOARD.md                            # Auto-generated status
├── STACK_STATUS.md                         # Version registry
└── README.md                               # This file

🔧 Configuration

Deployment Policies

Edit configs/deploy_policies.yaml to control deployment behavior:

services:
  cockpit:
    repo: mjdevaccount/market_data_cockpit
    auto_deploy: true              # Enable auto-deployment
    verify_url: http://cockpit:8505/health
    verify_wait_seconds: 30
    
    # SLO definitions
    slo:
      availability:
        query: 'sum(rate(...)) / sum(rate(...))'
        operator: '>='
        threshold: 0.995           # 99.5% uptime
      latency_p95:
        query: '1000 * histogram_quantile(...)'
        operator: '<='
        threshold: 600             # 600ms max
    
    slo_policy:
      max_violations: 3            # Disable after 3 SLO failures
      reinstate_after_passes: 3    # Re-enable after 3 successes
    
    # Canary rollout strategy
    rollout:
      strategy: canary
      steps:
        - pct: 0.1               # Deploy to 10%
          wait_seconds: 60
          guards:
            - name: error_rate
              threshold: 0.02    # Loose threshold initially
        
        - pct: 0.5               # Deploy to 50%
          wait_seconds: 120
          guards:
            - name: error_rate
              threshold: 0.015   # Tighten threshold
            - name: p95_latency_ms
              threshold: 600
        
        - pct: 1.0               # Full deployment
          wait_seconds: 180
          guards:
            - name: error_rate
              threshold: 0.01    # Strictest threshold
      
      precheck_guards:
        - name: cluster_health
          query: 'up == 1'
          operator: '>='
          threshold: 0.95        # 95% of targets must be up

Scheduler Configuration

Edit configs/deploy_scheduler.yaml:

interval_seconds: 60             # Scan interval
initial_delay_seconds: 5
max_per_hour: 6                  # Rate limit per service

quiet_hours:
  start_hour_utc: 2              # 02:00 UTC
  end_hour_utc: 5                # 05:00 UTC

registry_file: "/app/registry/versions.json"
resolve_strategy: "service"      # "service" | "repo_short"

Environment Variables

# Database
POSTGRES_DB=market_data
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_PORT=5433

# Registry
REGISTRY_URL=http://registry:8000
REGISTRY_TRACK=latest

# Deployment
GITHUB_WEBHOOK_SECRET=your_secret_here
WEBHOOK_ALLOWLIST=github.com,api.github.com

# AI Advisor
ADVISOR_MODE=heuristic           # "heuristic" | "llm"
OPENAI_API_KEY=sk_your_key       # For LLM mode
OLLAMA_URL=http://ollama:11434   # For local LLM

# Monitoring
PROMETHEUS_URL=http://prometheus:9090
GRAFANA_ADMIN_PASSWORD=admin

📊 Metrics & Monitoring

Deployment Metrics

# Deployment attempts
infra_deploy_attempts_total{service="cockpit"}

# Success rate
rate(infra_deploy_success_total[1h]) / rate(infra_deploy_attempts_total[1h])

# Rollback frequency
increase(infra_deploy_rollback_total[24h])

# Canary health
infra_canary_guard_fail_total{service="cockpit",guard="error_rate"}
infra_canary_current_step{service="cockpit"}
infra_canary_duration_seconds_bucket

# SLO compliance
infra_slo_health_ratio{service="cockpit",slo="availability"}
infra_slo_violation_total{service="cockpit"}

# Auto-deploy queue
infra_autodeploy_queue_depth
infra_autodeploy_success_total
infra_autodeploy_rollback_total

# AI Advisor
infra_ai_suggestions_total{service="cockpit"}
infra_ai_applied_total{service="cockpit",kind="threshold_adjust"}
infra_ai_rejected_total{service="cockpit"}

# Scheduler
infra_autosched_runs_total
infra_autosched_paused
infra_autosched_last_run_timestamp

Grafana Dashboards

Pre-configured dashboards available at http://localhost:3000:

  • Platform Overview - Service health, versions, uptime
  • Deployment Intelligence - Canary metrics, rollout duration, success rates
  • SLO Dashboard - SLO compliance, violation trends
  • AI Advisor Metrics - Suggestion quality, application rate
  • Prometheus Alerts - Active alerts, firing history

🧪 Testing

Deployment Testing

# Test manual deployment
curl -X POST http://localhost:8000/runtime/deploy/execute/cockpit/v1.0.0

# Test auto-deploy webhook
./scripts/auto_deploy_test.sh market_data_cockpit v1.2.3

# Simulate registry update
./scripts/simulate_registry_bump.sh /app/registry/versions.json cockpit v1.2.4

# Test canary rollout
curl -X POST http://localhost:8000/runtime/deploy/canary/start \
  -H "Content-Type: application/json" \
  -d '{"service":"cockpit","tag":"v1.2.5"}'

# Test AI suggestions
curl -X POST http://localhost:8000/runtime/ai/scan
curl http://localhost:8000/runtime/ai/suggestions | jq

Health Verification

# Run complete health check
./scripts/verify_stack.sh

# Check deployment API
curl http://localhost:8000/health

# Check AI Advisor
curl http://localhost:8086/health

# Verify Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Check metrics endpoint
curl http://localhost:8000/metrics | grep infra_

📚 Documentation

Complete Documentation Set

Document Description
README.md This file - comprehensive overview
QUICK_REFERENCE.md Quick command reference
PLATFORM_OVERVIEW.md Architecture deep-dive
TROUBLESHOOTING.md Common issues and solutions
Phase 13 Guides Deployment Intelligence
PHASE_13_2_DEPLOY_GUIDE.md Manual deployment setup
PHASE_13_3_AUTO_DEPLOY_GUIDE.md Auto-deploy configuration
PHASE_13_4_SCHEDULER_GUIDE.md Autonomous scheduler
PHASE_13_5_CANARY_GUIDE.md Canary rollout strategies
PHASE_13_6_ADAPTIVE_GUIDE.md Adaptive intelligence
PHASE_13_7_SLO_GUIDE.md SLO-based monitoring
PHASE_13_8_AI_ADVISOR_GUIDE.md AI-powered tuning
Control Plane Platform Management
COCKPIT_QUICK_START.md 5-minute cockpit setup
COCKPIT_ROLLOUT_GUIDE.md Comprehensive deployment
AUTOMATION_INFRASTRUCTURE.md Automation system details
Generated Auto-Updated
DASHBOARD.md Live platform status
STACK_STATUS.md Version registry

🤝 Contributing

We welcome contributions! This is an actively developed project with continuous enhancements.

Development Workflow

  1. Fork and clone:
git clone https://github.com/YOUR_USERNAME/market_data_infra.git
cd market_data_infra
  1. Create a feature branch:
git checkout -b feature/phase-13.x-your-feature
  1. Make changes and test:
# Start platform
docker compose --profile infra up -d

# Test your changes
curl http://localhost:8000/your-endpoint

# Check logs
docker logs infra_web
  1. Run validation:
# Health checks
./scripts/verify_stack.sh

# Linting (if applicable)
cd src/infra_web && pylint **/*.py
  1. Commit with conventional commits:
git add .
git commit -m "feat(phase13.x): add your feature description"
  1. Push and create PR:
git push origin feature/phase-13.x-your-feature

Contribution Areas

We're particularly interested in contributions for:

  • 🤖 LLM Integration - OpenAI/Ollama for AI Advisor
  • ☸️ Kubernetes Deployment - Helm charts, operators
  • 🔐 Security Enhancements - RBAC, secrets management
  • 📊 Additional Metrics - Custom exporters, dashboards
  • 🧪 Testing - Integration tests, chaos engineering
  • 📚 Documentation - Guides, tutorials, examples

Code Style

  • Python: Follow PEP 8, use type hints
  • YAML: 2-space indentation
  • Shell: Use shellcheck for validation
  • Commits: Conventional commits (feat:, fix:, docs:, etc.)

🐛 Troubleshooting

Deployment Issues

Problem: Canary rollout stuck in "waiting" state

Solution:

# Check guard evaluation
curl http://localhost:8000/runtime/deploy/canary/status | jq '.rollouts.cockpit.guard_results'

# View Prometheus metrics
curl 'http://localhost:9090/api/v1/query?query=infra_canary_current_step{service="cockpit"}'

# Force abort if needed
curl -X POST http://localhost:8000/runtime/deploy/canary/abort/cockpit

Problem: SLO violations causing auto-disable

Solution:

# Check SLO status
curl http://localhost:8000/metrics | grep infra_slo_health_ratio

# Review violation history
curl 'http://localhost:9090/api/v1/query?query=increase(infra_slo_violation_total[1h])'

# Manually re-enable after fixing
# Edit configs/deploy_policies.yaml, set auto_deploy: true

Problem: AI Advisor not generating suggestions

Solution:

# Check AI Advisor health
curl http://localhost:8086/health

# Verify Prometheus connectivity
docker exec ai_advisor curl http://prometheus:9090/-/healthy

# Review AI Advisor logs
docker logs ai_advisor

# Test with specific service
curl -X POST http://localhost:8086/ai/evaluate \
  -H "Content-Type: application/json" \
  -d '{"service":"cockpit","current_policy":{...}}'

For comprehensive troubleshooting, see TROUBLESHOOTING.md.


📜 License

This project is licensed under the MIT License - see the LICENSE file for details.


🙏 Acknowledgments

  • TimescaleDB - Time-series database excellence
  • Prometheus - Metrics collection and alerting
  • Grafana - Beautiful visualizations
  • Streamlit - Rapid dashboard development
  • FastAPI - Modern Python web framework
  • Docker - Containerization platform
  • GitHub Actions - CI/CD automation
  • NumPy - Scientific computing for AI Advisor

📊 Project Status

Current Phase: 13.8 - AI-Powered Deployment Intelligence
Status: ✅ PRODUCTION READY
Last Updated: October 25, 2025

Platform Metrics

  • Services: 12 (6 core + 4 infrastructure + 2 intelligence)
  • API Endpoints: 50+ across deployment lifecycle
  • Automation: 8+ GitHub Actions workflows
  • Documentation: 20+ comprehensive guides
  • Prometheus Metrics: 50+ custom metrics
  • Alert Rules: 25+ proactive monitoring rules
  • Test Coverage: Health checks, smoke tests, validation scripts
  • Deployment Strategies: Manual, auto-deploy, canary, SLO-based

Phase Timeline

  • Phase 1-3: Infrastructure & Platform Integration
  • Phase 4: Control Plane & Cockpit
  • Phase 13.2: Runtime Deploy & Self-Update
  • Phase 13.3: Auto-Deploy & Webhook Integration
  • Phase 13.4: Autonomous Scheduler
  • Phase 13.5: Canary Rollouts
  • Phase 13.6: Adaptive Canary Intelligence
  • Phase 13.7: SLO-Based Rollback & Auto-Reinstate
  • Phase 13.8: AI Advisor Service
  • 🚧 Phase 14: Cloud-Native Deployment (Kubernetes, Helm)
  • 📋 Phase 15: Multi-Region & HA

Recent Achievements (Phase 13)

October 2025 - Deployment Intelligence Revolution:

  1. Runtime Deploy System (13.2)

    • Manual deployment API
    • Health verification & rollback
    • Prometheus metrics integration
    • State management
  2. Auto-Deploy Engine (13.3)

    • GitHub webhook integration
    • Policy-driven automation
    • Threaded deployment queue
    • Automatic rollback on failure
  3. Autonomous Scheduler (13.4)

    • Background registry scanner
    • Rate limiting & quiet hours
    • Live control API
    • Real-time status dashboard
  4. Canary Rollouts (13.5)

    • Step-wise progressive deployment
    • PromQL guard evaluation
    • Automatic rollback on guard failure
    • Cockpit integration
  5. Adaptive Intelligence (13.6)

    • Precheck guard system
    • Step-specific threshold tuning
    • Auto-disable on repeated failures
    • Historical insights dashboard
  6. SLO Monitoring (13.7)

    • Continuous SLO evaluation
    • Auto-disable on violations
    • Auto-reinstate on recovery
    • Real-time compliance tracking
  7. AI Advisor (13.8)

    • Heuristic threshold analysis
    • LLM-ready architecture
    • Prometheus-driven insights
    • Human-in-the-loop approval

Technology Stack

Languages:

  • Python 3.11+
  • Bash
  • YAML

Frameworks:

  • FastAPI (API services)
  • Streamlit (UI dashboards)
  • Prometheus (metrics)
  • Docker Compose (orchestration)

Databases:

  • PostgreSQL 15
  • TimescaleDB 2.20
  • Redis 7.x

Monitoring:

  • Prometheus 2.x
  • Grafana 10.x
  • Loki 2.x

AI/ML:

  • NumPy (data analysis)
  • OpenAI API (future)
  • Ollama (future)

🔗 Links

Repositories

Resources


Built with ❤️ for production-grade, AI-powered market data infrastructure

🚀 Self-Healing • 🤖 AI-Optimized • 📊 Fully Observable • ☸️ Cloud-Ready

⬆ Back to Top

About

Production-ready orchestration hub for a self-aware market data platform. Manages 6 microservices (schema registry, SDK, TimescaleDB store, validator, IBKR provider, orchestrator) with Docker Compose, automated version tracking, Streamlit cockpit, and Prometheus/Grafana monitoring.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 3

  •  
  •  
  •