🏗️ Market Data Infrastructure

Self-healing, AI-powered, production-grade market data platform with autonomous deployment intelligence

Features • Quick Start • Architecture • Deployment • Documentation • Contributing

📋 Overview

market_data_infra is an enterprise-grade infrastructure orchestration hub for a self-aware, self-healing market data platform. It combines microservices architecture, GitOps principles, and AI-powered deployment intelligence to deliver production-ready market data infrastructure.

🎯 Core Capabilities

🤖 AI-Powered Policy Tuning - Machine learning-based threshold optimization from production metrics
🚀 Autonomous Deployments - Self-healing canary rollouts with automatic rollback
📊 SLO-Based Guardrails - Service Level Objective monitoring with auto-disable/reinstate
🎭 Canary Intelligence - Adaptive step-wise rollouts with Prometheus guard evaluation
🧠 Self-Aware Platform - Automatic version tracking, drift detection, and health monitoring
🛩️ Interactive Cockpit - Real-time Streamlit control panel with live metrics
🔄 GitOps Automation - Repository dispatch integration for seamless CI/CD
📈 Comprehensive Observability - Prometheus, Grafana, and custom metrics

Platform Components

The infrastructure orchestrates 12+ services across multiple layers:

Core Platform Services

Service	Repository	Purpose	Port
Registry	`schema-registry-service`	Central schema store	8080
Core	`market-data-core`	SDK, events, schema publisher	8081
Store	`market-data-store`	TimescaleDB backend, drift reporter	8082
Pipeline	`market_data_pipeline`	Validator, enforcer, rules engine	8083
IBKR	`market_data_ibkr`	Market data provider adapters	8084
Orchestrator	`market_data_orchestrator`	Dashboards, telemetry	8501

Infrastructure & Control Plane

Service	Purpose	Port
PostgreSQL/TimescaleDB	Time-series database	5433
Prometheus	Metrics collection & alerting	9090
Grafana	Visualization dashboards	3000
Loki	Log aggregation	3100
Redis	Caching & pub/sub	6379
Cockpit	Platform control panel (Streamlit)	8505

Deployment Intelligence Layer (Phase 13+)

Service	Purpose	Port
infra_web	Deployment orchestration & webhooks	8000
ai_advisor	AI-powered policy tuning	8086

✨ Features

🚀 Deployment Intelligence (Phase 13)

Phase 13.2: Runtime Deploy & Self-Update

Manual Deployment - Controlled service deployments via API
Health Verification - Post-deploy health checks with automatic rollback
State Management - Last-known-good version tracking
Prometheus Integration - Deployment metrics (attempts, success, failures)

Phase 13.3: Auto-Deploy (Webhook + Policy Enforcement)

GitHub Webhook Integration - Automatic deployment on release
Policy-Driven Decisions - auto_deploy: true/false per service
Threaded Queue - Serialized deployment processing
HMAC Signature Validation - Secure webhook authentication
Automatic Rollback - Health-guard triggered rollbacks on failure

Phase 13.4: Autonomous Scheduler

Background Registry Scanner - Continuous version monitoring
Rate Limiting - Configurable deploy attempts per hour
Quiet Hours - Maintenance window awareness
Per-Service Policies - Independent auto-deploy configuration
Live Control API - Pause/resume/force-run capabilities

Phase 13.5: Canary Rollouts

Step-Wise Deployment - Progressive rollout (10% → 50% → 100%)
Guard Evaluation - PromQL-based health checks between steps
Automatic Rollback - Guard failure triggers instant revert
Cockpit Integration - Real-time canary monitoring dashboard
Prometheus Metrics - Rollout duration, step tracking, guard failures

Phase 13.6: Adaptive Canary Intelligence

Precheck Guards - Cluster health validation before deployment
Step-Specific Guards - Progressive threshold tightening
Auto-Disable on Failure - Service quarantine after 3 consecutive failures
Failure Tracking - Per-service failure counter
Insights Dashboard - Historical analysis and trend visualization

Phase 13.7: SLO-Based Rollback + Auto-Reinstate

SLO Definitions - Availability, latency, error rate targets
Continuous Monitoring - Every 60s SLO evaluation
Auto-Disable on Breach - 3 violations → service disabled
Auto-Reinstate on Recovery - 3 consecutive passes → re-enabled
SLO Dashboard - Real-time compliance tracking

Phase 13.8: AI Advisor Service

Heuristic Analysis - Rule-based threshold optimization
LLM-Ready Architecture - OpenAI/Ollama integration framework
Prometheus-Driven - 30-minute metric window analysis
Policy Suggestions - Latency & error rate threshold tuning
Confidence Scoring - 0-1 confidence per suggestion
Human-in-the-Loop - Manual approval before applying changes

🎯 Orchestration & Deployment

Profile-Based Management - Start services individually or as complete stack
Health Check Dependencies - Services wait for dependencies before starting
Network Isolation - Dedicated Docker network (mdnet) for service communication
Volume Management - Persistent data storage for PostgreSQL
Environment Configuration - Centralized .env for all services
Docker Compose Profiles - infra, core, store, pipeline, orchestrator, cockpit

🤖 Automation & Control Plane

Automatic Version Tracking - Downstream releases update central registry automatically
Pinned Compose Generation - Reproducible deployments with frozen image tags
Repository Dispatch Integration - Services notify infrastructure on release
Workflow Automation - GitHub Actions for all automation tasks
Policy Versioning - Timestamped backups with diff viewer
Config Hot-Reload - Zero-downtime policy updates

📊 Monitoring & Observability

Live Dashboard - Auto-generated status with CI/PyPI badges (hourly refresh)
Interactive Cockpit - 10+ Streamlit panels for platform control
- Platform Overview
- Streaming Health
- Runtime Audit
- Auto-Deploy Control
- Policy Backups & Restore
- Scheduler Control
- Canary Rollouts
- Canary Insights
- SLO Status
- AI Suggestions
Prometheus Metrics - 50+ custom metrics across deployment lifecycle
Grafana Dashboards - Pre-configured visualization dashboards
Alert Rules - 20+ alert rules for proactive monitoring
Version Mismatch Detection - Automatic detection of registry vs PyPI drift
Log Aggregation - Loki integration for centralized logging

🛠️ Developer Experience

Makefile Commands - Simple make targets for common operations
Validation Scripts - Automated health checks and smoke tests
Comprehensive Documentation - 20+ guides for setup, usage, and troubleshooting
Local Development - Full stack runs locally with Docker
CI/CD Ready - Workflows for testing and deployment
Hot Reload Support - Development mode with live code updates

🚀 Quick Start

Prerequisites

Docker 24.0+ with Docker Compose v2
Git for cloning repositories
Make (optional, but recommended)
Python 3.11+ (for local scripts)
curl and jq (for API testing)

Environment Setup

Copy the environment template:
```
cp github-runner.env.example .env
```

Configure required variables:

# Essential
REPO_ACCESS_TOKEN=ghp_your_github_token_here
GITHUB_WEBHOOK_SECRET=your_webhook_secret_here

# Optional (defaults provided)
POSTGRES_PASSWORD=postgres
GRAFANA_ADMIN_PASSWORD=admin
OPENAI_API_KEY=sk_your_openai_key  # For AI Advisor LLM mode

Generate GitHub PAT:
- Go to GitHub Settings > Developer settings > Personal access tokens
- Create a new token with scopes: repo, workflow, write:packages
- Copy the token to REPO_ACCESS_TOKEN in your .env file

Installation

Clone the infrastructure repository:

git clone https://github.com/mjdevaccount/market_data_infra.git
cd market_data_infra

Clone sibling service repositories:

cd ..
git clone https://github.com/mjdevaccount/schema-registry-service.git
git clone https://github.com/mjdevaccount/market-data-core.git
git clone https://github.com/mjdevaccount/market-data-store.git
git clone https://github.com/mjdevaccount/market_data_pipeline.git
git clone https://github.com/mjdevaccount/market_data_ibkr.git
git clone https://github.com/mjdevaccount/market_data_orchestrator.git
git clone https://github.com/mjdevaccount/market_data_cockpit.git
cd market_data_infra

Expected directory structure:

parent_directory/
├── market_data_infra/          # ⭐ This repo
├── schema-registry-service/
├── market_data_core/
├── market_data_store/
├── market_data_pipeline/
├── market_data_ibkr/
├── market_data_orchestrator/
└── market_data_cockpit/

Running the Platform

Full Stack (Recommended)

# Start everything
make up

# Or with Docker Compose directly
docker compose --profile infra --profile core --profile store \
  --profile pipeline --profile orchestrator --profile cockpit up -d

Selective Startup

# Infrastructure only (database, monitoring, control plane)
make up-infra

# Core platform services
docker compose --profile core --profile store up -d

# Processing layer
docker compose --profile pipeline up -d

# Control panels
make up-cockpit

Verification

# Check service health
docker ps

# Test deployment API
curl http://localhost:8000/health

# Test AI Advisor
curl http://localhost:8086/health

# Access dashboards
open http://localhost:8505  # Cockpit UI
open http://localhost:3000  # Grafana (admin/admin)
open http://localhost:9090  # Prometheus
open http://localhost:8000/metrics  # Deployment metrics

First Deployment Test

# Generate AI suggestions
curl -X POST http://localhost:8000/runtime/ai/scan

# View suggestions
curl http://localhost:8000/runtime/ai/suggestions | jq

# Trigger a manual deployment
curl -X POST http://localhost:8000/runtime/deploy/execute/cockpit/v1.0.0

# Check deployment status
curl http://localhost:8000/runtime/deploy/status | jq

💡 Pro Tip: Start with the Cockpit UI for a visual introduction to all platform capabilities!

Stopping the Platform

# Stop all services
make down

# Stop and remove volumes (⚠️ DATA LOSS)
make nuke

🏛️ Architecture

System Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                     MARKET DATA PLATFORM (Phase 13)                          │
│                                                                               │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                    DEPLOYMENT INTELLIGENCE LAYER                       │  │
│  │                                                                         │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐  ┌────────────┐ │  │
│  │  │ AI Advisor  │  │ infra_web   │  │   Cockpit    │  │ Prometheus │ │  │
│  │  │             │  │             │  │              │  │            │ │  │
│  │  │ • Heuristic │◀─│ • Webhooks  │◀─│ • 10 Panels  │◀─│ • Metrics  │ │  │
│  │  │ • LLM Ready │  │ • Auto-     │  │ • Real-time  │  │ • Alerts   │ │  │
│  │  │ • Suggest   │  │   Deploy    │  │ • Control    │  │ • SLOs     │ │  │
│  │  │   Tuning    │  │ • Canary    │  │ • Insights   │  │ • Guards   │ │  │
│  │  └─────────────┘  └─────────────┘  └──────────────┘  └────────────┘ │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                    ▼                                          │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                        APPLICATION LAYER                               │  │
│  │                                                                         │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                │  │
│  │  │  PROVIDERS   │  │   PIPELINE   │  │ ORCHESTRATOR │                │  │
│  │  │              │  │              │  │              │                │  │
│  │  │ market_data  │─▶│ market-data- │─▶│ market-data- │                │  │
│  │  │    _ibkr     │  │   pipeline   │  │ orchestrator │                │  │
│  │  │              │  │              │  │              │                │  │
│  │  │ • IBKR feed  │  │ • Validator  │  │ • Dashboards │                │  │
│  │  │ • Synthetic  │  │ • Enforcer   │  │ • Telemetry  │                │  │
│  │  │ • Replay     │  │ • Rules      │  │ • Control    │                │  │
│  │  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘                │  │
│  │         │                 │                 │                          │  │
│  │         └─────────┬───────┴─────────┬───────┘                          │  │
│  │                   ▼                 ▼                                   │  │
│  │         ┌──────────────────────────────────┐                           │  │
│  │         │     CORE SDK & REGISTRY          │                           │  │
│  │         │                                  │                           │  │
│  │         │  market-data-core                │                           │  │
│  │         │  schema-registry-service         │                           │  │
│  │         └──────────────┬───────────────────┘                           │  │
│  │                        │                                                │  │
│  │                        ▼                                                │  │
│  │         ┌──────────────────────────────────┐                           │  │
│  │         │      DATA STORE & DATABASE       │                           │  │
│  │         │                                  │                           │  │
│  │         │  market-data-store               │                           │  │
│  │         │  PostgreSQL/TimescaleDB          │                           │  │
│  │         └──────────────────────────────────┘                           │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                               │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                  INFRASTRUCTURE & OBSERVABILITY                        │  │
│  │                                                                         │  │
│  │  Prometheus → Grafana → Loki → Redis → Cockpit                        │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Deployment Intelligence Flow

┌─────────────────┐
│ GitHub Release  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Webhook Event   │ ──HMAC──▶ Signature Validation
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Auto-Scheduler  │ ◀──── Continuous Registry Scan (60s)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Policy Check    │ ──auto_deploy: true?──▶ Queue or Manual
└────────┬────────┘
         │ YES
         ▼
┌─────────────────┐
│ Precheck Guards │ ──Cluster Health? Alerts?──▶ Abort if unsafe
└────────┬────────┘
         │ PASS
         ▼
┌─────────────────┐
│ Canary Deploy   │
│  Step 1: 10%    │ ──Wait 60s──▶ Evaluate Guards
│  Step 2: 50%    │ ──Wait 120s──▶ Evaluate Guards
│  Step 3: 100%   │ ──Wait 180s──▶ Evaluate Guards
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
   PASS      FAIL
    │         │
    ▼         ▼
┌────────┐  ┌─────────────┐
│Success │  │  Rollback   │
│        │  │  + Disable  │
└────┬───┘  └──────┬──────┘
     │             │
     ▼             ▼
┌──────────────────────┐
│ SLO Monitoring       │ ◀──── Every 60s
│ (Continuous)         │
│                      │
│ 3 violations → Disable
│ 3 passes → Reinstate │
└──────────────────────┘
         │
         ▼
┌──────────────────────┐
│ AI Advisor           │
│ Analyzes Metrics     │
│ Suggests Tuning      │
│                      │
│ Human Review         │
│ Apply Changes        │
└──────────────────────┘

Data Flow

1. INGESTION
   market_data_ibkr → market-data-core
   (Provider feeds data to SDK)

2. VALIDATION
   market-data-core → market-data-pipeline
   (SDK publishes to pipeline for validation)

3. ENFORCEMENT
   market-data-pipeline → schema-registry-service
   (Pipeline checks against registered schemas)

4. STORAGE
   market-data-pipeline → market-data-store → postgres
   (Valid data persisted to TimescaleDB)

5. DRIFT DETECTION
   market-data-store → schema-registry-service
   (Store reports schema drift)

6. MONITORING
   All services → prometheus → grafana → cockpit
   (Metrics collected, visualized, and controlled)

7. DEPLOYMENT
   infra_web → docker compose → health checks → prometheus guards
   (Automated deployment with health verification)

8. AI OPTIMIZATION
   ai_advisor → prometheus → suggestions → infra_web → policies
   (ML-driven threshold tuning)

🎮 Deployment Intelligence

Canary Deployment Workflow

# 1. View current deployment policies
curl http://localhost:8000/runtime/deploy/plan | jq

# 2. Start a canary rollout
curl -X POST http://localhost:8000/runtime/deploy/canary/start \
  -H "Content-Type: application/json" \
  -d '{"service":"cockpit","tag":"v1.2.3"}' | jq

# 3. Monitor progress
curl http://localhost:8000/runtime/deploy/canary/status | jq

# 4. Abort if needed
curl -X POST http://localhost:8000/runtime/deploy/canary/abort/cockpit

AI-Powered Policy Tuning

# 1. Scan services for optimization opportunities
curl -X POST http://localhost:8000/runtime/ai/scan | jq

# 2. View AI suggestions
curl http://localhost:8000/runtime/ai/suggestions | jq

# Example response:
# {
#   "suggestions": {
#     "cockpit": [
#       {
#         "kind": "threshold_adjust",
#         "path": "services.cockpit.rollout.guards[p95_latency_ms].threshold",
#         "from_value": 600.0,
#         "to_value": 520.5,
#         "rationale": "Observed median p95=473.2ms over 30m; nudging threshold toward 110% of observed.",
#         "confidence": 0.78
#       }
#     ]
#   }
# }

# 3. Apply suggestions (via Cockpit UI or API)
curl -X POST http://localhost:8000/runtime/ai/apply/cockpit

# 4. Or reject
curl -X POST http://localhost:8000/runtime/ai/reject/cockpit

SLO Monitoring

# Check SLO compliance
curl http://localhost:8000/runtime/deploy/canary/status | jq '.rollouts'

# View SLO health metrics
curl http://localhost:8000/metrics | grep infra_slo_health_ratio

# Example metrics:
# infra_slo_health_ratio{service="cockpit",slo="availability"} 1.0
# infra_slo_health_ratio{service="cockpit",slo="latency_p95"} 1.0
# infra_slo_health_ratio{service="cockpit",slo="error_rate"} 1.0

Auto-Scheduler Control

# Check scheduler status
curl http://localhost:8000/runtime/deploy/auto/status | jq

# Pause auto-deployments
curl -X POST http://localhost:8000/runtime/deploy/auto/pause

# Resume
curl -X POST http://localhost:8000/runtime/deploy/auto/resume

# Force immediate scan
curl -X POST http://localhost:8000/runtime/deploy/auto/run

# Update scan interval
curl -X POST http://localhost:8000/runtime/deploy/auto/interval/120

📁 Project Structure

market_data_infra/
├── .github/
│   └── workflows/
│       ├── on_downstream_release.yml       # Release notifications
│       ├── platform_rebuild.yml            # Pinned compose generation
│       ├── dashboard_sync.yml              # Hourly dashboard refresh
│       ├── nightly_version_check.yml       # Daily PyPI validation
│       └── phase13_*.yml                   # Deployment automation tests
│
├── ai_advisor/                             # 🆕 AI-powered policy tuning
│   ├── src/
│   │   └── main.py                         # FastAPI service
│   ├── Dockerfile
│   └── requirements.txt
│
├── build/
│   └── docker-compose.platform.yml         # Auto-generated pinned compose
│
├── cockpit/                                # 🆕 Enhanced control panel
│   ├── ui/
│   │   ├── auto_deploy.py                  # Auto-deploy control
│   │   ├── policy_backups.py               # Policy backup/restore
│   │   ├── scheduler_control.py            # Scheduler management
│   │   ├── canary_rollouts.py              # Canary monitoring
│   │   ├── canary_insights.py              # Analytics dashboard
│   │   ├── slo_status.py                   # SLO compliance
│   │   └── ai_suggestions.py               # AI recommendations
│   ├── app.py                              # Main Streamlit app
│   └── Dockerfile
│
├── configs/                                # 🆕 Declarative configuration
│   ├── deploy_policies.yaml                # Deployment rules
│   ├── deploy_scheduler.yaml               # Scheduler config
│   └── backups/                            # Versioned policy backups
│
├── docker/
│   ├── compose/                            # Service-specific compose files
│   └── initdb.d/                           # PostgreSQL init scripts
│
├── docs/
│   ├── AUTOMATION_INFRASTRUCTURE.md
│   ├── PHASE_4_CONTROL_PLANE.md
│   ├── PHASE_13_2_DEPLOY_GUIDE.md          # 🆕 Manual deployment
│   ├── PHASE_13_3_AUTO_DEPLOY_GUIDE.md     # 🆕 Auto-deploy setup
│   ├── PHASE_13_4_SCHEDULER_GUIDE.md       # 🆕 Scheduler config
│   ├── PHASE_13_5_CANARY_GUIDE.md          # 🆕 Canary rollouts
│   ├── PHASE_13_6_ADAPTIVE_GUIDE.md        # 🆕 Adaptive intelligence
│   ├── PHASE_13_7_SLO_GUIDE.md             # 🆕 SLO monitoring
│   ├── PHASE_13_8_AI_ADVISOR_GUIDE.md      # 🆕 AI tuning
│   ├── PLATFORM_OVERVIEW.md
│   ├── QUICK_REFERENCE.md
│   └── TROUBLESHOOTING.md
│
├── monitoring/
│   ├── grafana/
│   │   ├── dashboards/
│   │   └── provisioning/
│   └── prometheus/
│       ├── prometheus.yml
│       ├── alerts.yml
│       └── rules/                          # 🆕 Deployment alert rules
│           ├── alerts-deploy-canary.yml
│           ├── alerts-canary-intel.yml
│           ├── alerts-slo.yml
│           └── alerts-ai-advisor.yml
│
├── registry/
│   └── versions.json                       # Version registry (auto-updated)
│
├── scripts/
│   ├── deploy_latest.sh                    # 🆕 Deployment script
│   ├── rollback.sh                         # 🆕 Rollback script
│   ├── verify_stack.sh                     # 🆕 Health verification
│   ├── auto_deploy_test.sh                 # 🆕 Webhook simulator
│   ├── simulate_registry_bump.sh           # 🆕 Registry test
│   ├── dashboard_generator.py
│   ├── registry_updater.py
│   └── health-check.sh
│
├── src/
│   └── infra_web/                          # 🆕 Deployment orchestration
│       ├── routes/
│       │   ├── deploy.py                   # Manual deploy API
│       │   ├── deploy_webhook.py           # Webhook handler
│       │   ├── deploy_policies.py          # Policy management
│       │   ├── deploy_auto.py              # Scheduler control
│       │   ├── deploy_canary.py            # Canary API
│       │   └── ai.py                       # AI Advisor API
│       ├── services/
│       │   ├── deploy_controller.py        # Deploy logic
│       │   ├── webhook_controller.py       # Webhook validation
│       │   ├── autodeploy_worker.py        # Background worker
│       │   ├── auto_scheduler.py           # Registry scanner
│       │   ├── canary_engine.py            # Canary orchestration
│       │   ├── guard_evaluator.py          # Prometheus guards
│       │   └── ai_client.py                # AI Advisor client
│       ├── state/
│       │   ├── deploy_state.py             # Deployment state
│       │   └── scheduler_state.py          # Scheduler state
│       ├── main.py                         # FastAPI app
│       ├── Dockerfile
│       └── requirements.txt
│
├── state/                                  # 🆕 Runtime state
│   └── last_good.json                      # Last-known-good versions
│
├── docker-compose.yml                      # Main orchestration
├── Makefile                                # Management commands
├── DASHBOARD.md                            # Auto-generated status
├── STACK_STATUS.md                         # Version registry
└── README.md                               # This file

🔧 Configuration

Deployment Policies

Edit configs/deploy_policies.yaml to control deployment behavior:

services:
  cockpit:
    repo: mjdevaccount/market_data_cockpit
    auto_deploy: true              # Enable auto-deployment
    verify_url: http://cockpit:8505/health
    verify_wait_seconds: 30
    
    # SLO definitions
    slo:
      availability:
        query: 'sum(rate(...)) / sum(rate(...))'
        operator: '>='
        threshold: 0.995           # 99.5% uptime
      latency_p95:
        query: '1000 * histogram_quantile(...)'
        operator: '<='
        threshold: 600             # 600ms max
    
    slo_policy:
      max_violations: 3            # Disable after 3 SLO failures
      reinstate_after_passes: 3    # Re-enable after 3 successes
    
    # Canary rollout strategy
    rollout:
      strategy: canary
      steps:
        - pct: 0.1               # Deploy to 10%
          wait_seconds: 60
          guards:
            - name: error_rate
              threshold: 0.02    # Loose threshold initially
        
        - pct: 0.5               # Deploy to 50%
          wait_seconds: 120
          guards:
            - name: error_rate
              threshold: 0.015   # Tighten threshold
            - name: p95_latency_ms
              threshold: 600
        
        - pct: 1.0               # Full deployment
          wait_seconds: 180
          guards:
            - name: error_rate
              threshold: 0.01    # Strictest threshold
      
      precheck_guards:
        - name: cluster_health
          query: 'up == 1'
          operator: '>='
          threshold: 0.95        # 95% of targets must be up

Scheduler Configuration

Edit configs/deploy_scheduler.yaml:

interval_seconds: 60             # Scan interval
initial_delay_seconds: 5
max_per_hour: 6                  # Rate limit per service

quiet_hours:
  start_hour_utc: 2              # 02:00 UTC
  end_hour_utc: 5                # 05:00 UTC

registry_file: "/app/registry/versions.json"
resolve_strategy: "service"      # "service" | "repo_short"

Environment Variables

# Database
POSTGRES_DB=market_data
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_PORT=5433

# Registry
REGISTRY_URL=http://registry:8000
REGISTRY_TRACK=latest

# Deployment
GITHUB_WEBHOOK_SECRET=your_secret_here
WEBHOOK_ALLOWLIST=github.com,api.github.com

# AI Advisor
ADVISOR_MODE=heuristic           # "heuristic" | "llm"
OPENAI_API_KEY=sk_your_key       # For LLM mode
OLLAMA_URL=http://ollama:11434   # For local LLM

# Monitoring
PROMETHEUS_URL=http://prometheus:9090
GRAFANA_ADMIN_PASSWORD=admin

📊 Metrics & Monitoring

Deployment Metrics

# Deployment attempts
infra_deploy_attempts_total{service="cockpit"}

# Success rate
rate(infra_deploy_success_total[1h]) / rate(infra_deploy_attempts_total[1h])

# Rollback frequency
increase(infra_deploy_rollback_total[24h])

# Canary health
infra_canary_guard_fail_total{service="cockpit",guard="error_rate"}
infra_canary_current_step{service="cockpit"}
infra_canary_duration_seconds_bucket

# SLO compliance
infra_slo_health_ratio{service="cockpit",slo="availability"}
infra_slo_violation_total{service="cockpit"}

# Auto-deploy queue
infra_autodeploy_queue_depth
infra_autodeploy_success_total
infra_autodeploy_rollback_total

# AI Advisor
infra_ai_suggestions_total{service="cockpit"}
infra_ai_applied_total{service="cockpit",kind="threshold_adjust"}
infra_ai_rejected_total{service="cockpit"}

# Scheduler
infra_autosched_runs_total
infra_autosched_paused
infra_autosched_last_run_timestamp

Grafana Dashboards

Pre-configured dashboards available at http://localhost:3000:

Platform Overview - Service health, versions, uptime
Deployment Intelligence - Canary metrics, rollout duration, success rates
SLO Dashboard - SLO compliance, violation trends
AI Advisor Metrics - Suggestion quality, application rate
Prometheus Alerts - Active alerts, firing history

🧪 Testing

Deployment Testing

# Test manual deployment
curl -X POST http://localhost:8000/runtime/deploy/execute/cockpit/v1.0.0

# Test auto-deploy webhook
./scripts/auto_deploy_test.sh market_data_cockpit v1.2.3

# Simulate registry update
./scripts/simulate_registry_bump.sh /app/registry/versions.json cockpit v1.2.4

# Test canary rollout
curl -X POST http://localhost:8000/runtime/deploy/canary/start \
  -H "Content-Type: application/json" \
  -d '{"service":"cockpit","tag":"v1.2.5"}'

# Test AI suggestions
curl -X POST http://localhost:8000/runtime/ai/scan
curl http://localhost:8000/runtime/ai/suggestions | jq

Health Verification

# Run complete health check
./scripts/verify_stack.sh

# Check deployment API
curl http://localhost:8000/health

# Check AI Advisor
curl http://localhost:8086/health

# Verify Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Check metrics endpoint
curl http://localhost:8000/metrics | grep infra_

📚 Documentation

Complete Documentation Set

Document	Description
README.md	This file - comprehensive overview
QUICK_REFERENCE.md	Quick command reference
PLATFORM_OVERVIEW.md	Architecture deep-dive
TROUBLESHOOTING.md	Common issues and solutions
Phase 13 Guides	Deployment Intelligence
PHASE_13_2_DEPLOY_GUIDE.md	Manual deployment setup
PHASE_13_3_AUTO_DEPLOY_GUIDE.md	Auto-deploy configuration
PHASE_13_4_SCHEDULER_GUIDE.md	Autonomous scheduler
PHASE_13_5_CANARY_GUIDE.md	Canary rollout strategies
PHASE_13_6_ADAPTIVE_GUIDE.md	Adaptive intelligence
PHASE_13_7_SLO_GUIDE.md	SLO-based monitoring
PHASE_13_8_AI_ADVISOR_GUIDE.md	AI-powered tuning
Control Plane	Platform Management
COCKPIT_QUICK_START.md	5-minute cockpit setup
COCKPIT_ROLLOUT_GUIDE.md	Comprehensive deployment
AUTOMATION_INFRASTRUCTURE.md	Automation system details
Generated	Auto-Updated
DASHBOARD.md	Live platform status
STACK_STATUS.md	Version registry

🤝 Contributing

We welcome contributions! This is an actively developed project with continuous enhancements.

Development Workflow

Fork and clone:

git clone https://github.com/YOUR_USERNAME/market_data_infra.git
cd market_data_infra

Create a feature branch:

git checkout -b feature/phase-13.x-your-feature

Make changes and test:

# Start platform
docker compose --profile infra up -d

# Test your changes
curl http://localhost:8000/your-endpoint

# Check logs
docker logs infra_web

Run validation:

# Health checks
./scripts/verify_stack.sh

# Linting (if applicable)
cd src/infra_web && pylint **/*.py

Commit with conventional commits:

git add .
git commit -m "feat(phase13.x): add your feature description"

Push and create PR:

git push origin feature/phase-13.x-your-feature

Contribution Areas

We're particularly interested in contributions for:

🤖 LLM Integration - OpenAI/Ollama for AI Advisor
☸️ Kubernetes Deployment - Helm charts, operators
🔐 Security Enhancements - RBAC, secrets management
📊 Additional Metrics - Custom exporters, dashboards
🧪 Testing - Integration tests, chaos engineering
📚 Documentation - Guides, tutorials, examples

Code Style

Python: Follow PEP 8, use type hints
YAML: 2-space indentation
Shell: Use shellcheck for validation
Commits: Conventional commits (feat:, fix:, docs:, etc.)

🐛 Troubleshooting

Deployment Issues

Problem: Canary rollout stuck in "waiting" state

Solution:

# Check guard evaluation
curl http://localhost:8000/runtime/deploy/canary/status | jq '.rollouts.cockpit.guard_results'

# View Prometheus metrics
curl 'http://localhost:9090/api/v1/query?query=infra_canary_current_step{service="cockpit"}'

# Force abort if needed
curl -X POST http://localhost:8000/runtime/deploy/canary/abort/cockpit

Problem: SLO violations causing auto-disable

Solution:

# Check SLO status
curl http://localhost:8000/metrics | grep infra_slo_health_ratio

# Review violation history
curl 'http://localhost:9090/api/v1/query?query=increase(infra_slo_violation_total[1h])'

# Manually re-enable after fixing
# Edit configs/deploy_policies.yaml, set auto_deploy: true

Problem: AI Advisor not generating suggestions

Solution:

# Check AI Advisor health
curl http://localhost:8086/health

# Verify Prometheus connectivity
docker exec ai_advisor curl http://prometheus:9090/-/healthy

# Review AI Advisor logs
docker logs ai_advisor

# Test with specific service
curl -X POST http://localhost:8086/ai/evaluate \
  -H "Content-Type: application/json" \
  -d '{"service":"cockpit","current_policy":{...}}'

For comprehensive troubleshooting, see TROUBLESHOOTING.md.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

TimescaleDB - Time-series database excellence
Prometheus - Metrics collection and alerting
Grafana - Beautiful visualizations
Streamlit - Rapid dashboard development
FastAPI - Modern Python web framework
Docker - Containerization platform
GitHub Actions - CI/CD automation
NumPy - Scientific computing for AI Advisor

📊 Project Status

Current Phase: 13.8 - AI-Powered Deployment Intelligence
Status: ✅ PRODUCTION READY
Last Updated: October 25, 2025

Platform Metrics

Services: 12 (6 core + 4 infrastructure + 2 intelligence)
API Endpoints: 50+ across deployment lifecycle
Automation: 8+ GitHub Actions workflows
Documentation: 20+ comprehensive guides
Prometheus Metrics: 50+ custom metrics
Alert Rules: 25+ proactive monitoring rules
Test Coverage: Health checks, smoke tests, validation scripts
Deployment Strategies: Manual, auto-deploy, canary, SLO-based

Phase Timeline

✅ Phase 1-3: Infrastructure & Platform Integration
✅ Phase 4: Control Plane & Cockpit
✅ Phase 13.2: Runtime Deploy & Self-Update
✅ Phase 13.3: Auto-Deploy & Webhook Integration
✅ Phase 13.4: Autonomous Scheduler
✅ Phase 13.5: Canary Rollouts
✅ Phase 13.6: Adaptive Canary Intelligence
✅ Phase 13.7: SLO-Based Rollback & Auto-Reinstate
✅ Phase 13.8: AI Advisor Service
🚧 Phase 14: Cloud-Native Deployment (Kubernetes, Helm)
📋 Phase 15: Multi-Region & HA

Recent Achievements (Phase 13)

October 2025 - Deployment Intelligence Revolution:

Runtime Deploy System (13.2)
- Manual deployment API
- Health verification & rollback
- Prometheus metrics integration
- State management
Auto-Deploy Engine (13.3)
- GitHub webhook integration
- Policy-driven automation
- Threaded deployment queue
- Automatic rollback on failure
Autonomous Scheduler (13.4)
- Background registry scanner
- Rate limiting & quiet hours
- Live control API
- Real-time status dashboard
Canary Rollouts (13.5)
- Step-wise progressive deployment
- PromQL guard evaluation
- Automatic rollback on guard failure
- Cockpit integration
Adaptive Intelligence (13.6)
- Precheck guard system
- Step-specific threshold tuning
- Auto-disable on repeated failures
- Historical insights dashboard
SLO Monitoring (13.7)
- Continuous SLO evaluation
- Auto-disable on violations
- Auto-reinstate on recovery
- Real-time compliance tracking
AI Advisor (13.8)
- Heuristic threshold analysis
- LLM-ready architecture
- Prometheus-driven insights
- Human-in-the-loop approval

Technology Stack

Languages:

Python 3.11+
Bash
YAML

Frameworks:

FastAPI (API services)
Streamlit (UI dashboards)
Prometheus (metrics)
Docker Compose (orchestration)

Databases:

PostgreSQL 15
TimescaleDB 2.20
Redis 7.x

Monitoring:

Prometheus 2.x
Grafana 10.x
Loki 2.x

AI/ML:

NumPy (data analysis)
OpenAI API (future)
Ollama (future)

🔗 Links

Repositories

Infrastructure Hub: market_data_infra ⭐
Platform Services:

Resources

Documentation: docs/
API Reference: http://localhost:8000/docs (when running)
Metrics: http://localhost:8000/metrics
Grafana: http://localhost:3000
Prometheus: http://localhost:9090

Built with ❤️ for production-grade, AI-powered market data infrastructure

🚀 Self-Healing • 🤖 AI-Optimized • 📊 Fully Observable • ☸️ Cloud-Ready

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 668 Commits
.github		.github
ai_advisor		ai_advisor
build		build
cockpit		cockpit
config		config
configs		configs
docker		docker
docs		docs
monitoring		monitoring
registry		registry
router		router
scripts		scripts
src/infra_web		src/infra_web
state		state
tests		tests
validation		validation
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
COCKPIT_AUDIT_SUMMARY.md		COCKPIT_AUDIT_SUMMARY.md
DASHBOARD.md		DASHBOARD.md
DEPLOYMENT_ANALYSIS.md		DEPLOYMENT_ANALYSIS.md
DEPLOYMENT_SUCCESS_SUMMARY.md		DEPLOYMENT_SUCCESS_SUMMARY.md
DEPLOYMENT_VERIFICATION_REPORT.md		DEPLOYMENT_VERIFICATION_REPORT.md
LICENSE		LICENSE
Makefile		Makefile
NAMING_ANALYSIS.md		NAMING_ANALYSIS.md
NAMING_CLEANUP_SUMMARY.md		NAMING_CLEANUP_SUMMARY.md
README.md		README.md
STACK_STATUS.md		STACK_STATUS.md
VALIDATION_CHECKLIST.md		VALIDATION_CHECKLIST.md
WORKFLOW_ANALYSIS.md		WORKFLOW_ANALYSIS.md
WORKFLOW_IMPLEMENTATION_SUMMARY.md		WORKFLOW_IMPLEMENTATION_SUMMARY.md
docker-compose.bluegreen.yml		docker-compose.bluegreen.yml
docker-compose.cockpit.yml		docker-compose.cockpit.yml
docker-compose.yml		docker-compose.yml
docker-compose.yml.backup		docker-compose.yml.backup
github-runner.env.example		github-runner.env.example
phase3-validate.ps1		phase3-validate.ps1
phase3-validate.sh		phase3-validate.sh
requirements.txt		requirements.txt
temp_core_workflow.yml		temp_core_workflow.yml
temp_ibkr.yml		temp_ibkr.yml
temp_orch.yml		temp_orch.yml
temp_store.yml		temp_store.yml

License

mjdevaccount/market_data_infra

Folders and files

Latest commit

History

Repository files navigation

🏗️ Market Data Infrastructure

📋 Overview

🎯 Core Capabilities

Platform Components

Core Platform Services

Infrastructure & Control Plane

Deployment Intelligence Layer (Phase 13+)

✨ Features

🚀 Deployment Intelligence (Phase 13)

Phase 13.2: Runtime Deploy & Self-Update

Phase 13.3: Auto-Deploy (Webhook + Policy Enforcement)

Phase 13.4: Autonomous Scheduler

Phase 13.5: Canary Rollouts

Phase 13.6: Adaptive Canary Intelligence

Phase 13.7: SLO-Based Rollback + Auto-Reinstate

Phase 13.8: AI Advisor Service

🎯 Orchestration & Deployment

🤖 Automation & Control Plane

📊 Monitoring & Observability

🛠️ Developer Experience

🚀 Quick Start

Prerequisites

Environment Setup

Installation

Running the Platform

Full Stack (Recommended)

Selective Startup

Verification

First Deployment Test

Stopping the Platform

🏛️ Architecture

System Overview

Deployment Intelligence Flow

Data Flow

🎮 Deployment Intelligence

Canary Deployment Workflow

AI-Powered Policy Tuning

SLO Monitoring

Auto-Scheduler Control

📁 Project Structure

🔧 Configuration

Deployment Policies

Scheduler Configuration

Environment Variables

📊 Metrics & Monitoring

Deployment Metrics

Grafana Dashboards

🧪 Testing

Deployment Testing

Health Verification

📚 Documentation

Complete Documentation Set

🤝 Contributing

Development Workflow

Contribution Areas

Code Style

🐛 Troubleshooting

Deployment Issues

📜 License

🙏 Acknowledgments

📊 Project Status

Platform Metrics

Phase Timeline

Recent Achievements (Phase 13)

Technology Stack

🔗 Links

Repositories

Resources

About

Topics

Resources

License

Uh oh!

Stars

Packages