Self-healing, AI-powered, production-grade market data platform with autonomous deployment intelligence
Features • Quick Start • Architecture • Deployment • Documentation • Contributing
market_data_infra is an enterprise-grade infrastructure orchestration hub for a self-aware, self-healing market data platform. It combines microservices architecture, GitOps principles, and AI-powered deployment intelligence to deliver production-ready market data infrastructure.
- 🤖 AI-Powered Policy Tuning - Machine learning-based threshold optimization from production metrics
- 🚀 Autonomous Deployments - Self-healing canary rollouts with automatic rollback
- 📊 SLO-Based Guardrails - Service Level Objective monitoring with auto-disable/reinstate
- 🎭 Canary Intelligence - Adaptive step-wise rollouts with Prometheus guard evaluation
- 🧠 Self-Aware Platform - Automatic version tracking, drift detection, and health monitoring
- 🛩️ Interactive Cockpit - Real-time Streamlit control panel with live metrics
- 🔄 GitOps Automation - Repository dispatch integration for seamless CI/CD
- 📈 Comprehensive Observability - Prometheus, Grafana, and custom metrics
The infrastructure orchestrates 12+ services across multiple layers:
| Service | Repository | Purpose | Port |
|---|---|---|---|
| Registry | schema-registry-service |
Central schema store | 8080 |
| Core | market-data-core |
SDK, events, schema publisher | 8081 |
| Store | market-data-store |
TimescaleDB backend, drift reporter | 8082 |
| Pipeline | market_data_pipeline |
Validator, enforcer, rules engine | 8083 |
| IBKR | market_data_ibkr |
Market data provider adapters | 8084 |
| Orchestrator | market_data_orchestrator |
Dashboards, telemetry | 8501 |
| Service | Purpose | Port |
|---|---|---|
| PostgreSQL/TimescaleDB | Time-series database | 5433 |
| Prometheus | Metrics collection & alerting | 9090 |
| Grafana | Visualization dashboards | 3000 |
| Loki | Log aggregation | 3100 |
| Redis | Caching & pub/sub | 6379 |
| Cockpit | Platform control panel (Streamlit) | 8505 |
| Service | Purpose | Port |
|---|---|---|
| infra_web | Deployment orchestration & webhooks | 8000 |
| ai_advisor | AI-powered policy tuning | 8086 |
- Manual Deployment - Controlled service deployments via API
- Health Verification - Post-deploy health checks with automatic rollback
- State Management - Last-known-good version tracking
- Prometheus Integration - Deployment metrics (attempts, success, failures)
- GitHub Webhook Integration - Automatic deployment on release
- Policy-Driven Decisions -
auto_deploy: true/falseper service - Threaded Queue - Serialized deployment processing
- HMAC Signature Validation - Secure webhook authentication
- Automatic Rollback - Health-guard triggered rollbacks on failure
- Background Registry Scanner - Continuous version monitoring
- Rate Limiting - Configurable deploy attempts per hour
- Quiet Hours - Maintenance window awareness
- Per-Service Policies - Independent auto-deploy configuration
- Live Control API - Pause/resume/force-run capabilities
- Step-Wise Deployment - Progressive rollout (10% → 50% → 100%)
- Guard Evaluation - PromQL-based health checks between steps
- Automatic Rollback - Guard failure triggers instant revert
- Cockpit Integration - Real-time canary monitoring dashboard
- Prometheus Metrics - Rollout duration, step tracking, guard failures
- Precheck Guards - Cluster health validation before deployment
- Step-Specific Guards - Progressive threshold tightening
- Auto-Disable on Failure - Service quarantine after 3 consecutive failures
- Failure Tracking - Per-service failure counter
- Insights Dashboard - Historical analysis and trend visualization
- SLO Definitions - Availability, latency, error rate targets
- Continuous Monitoring - Every 60s SLO evaluation
- Auto-Disable on Breach - 3 violations → service disabled
- Auto-Reinstate on Recovery - 3 consecutive passes → re-enabled
- SLO Dashboard - Real-time compliance tracking
- Heuristic Analysis - Rule-based threshold optimization
- LLM-Ready Architecture - OpenAI/Ollama integration framework
- Prometheus-Driven - 30-minute metric window analysis
- Policy Suggestions - Latency & error rate threshold tuning
- Confidence Scoring - 0-1 confidence per suggestion
- Human-in-the-Loop - Manual approval before applying changes
- Profile-Based Management - Start services individually or as complete stack
- Health Check Dependencies - Services wait for dependencies before starting
- Network Isolation - Dedicated Docker network (
mdnet) for service communication - Volume Management - Persistent data storage for PostgreSQL
- Environment Configuration - Centralized
.envfor all services - Docker Compose Profiles -
infra,core,store,pipeline,orchestrator,cockpit
- Automatic Version Tracking - Downstream releases update central registry automatically
- Pinned Compose Generation - Reproducible deployments with frozen image tags
- Repository Dispatch Integration - Services notify infrastructure on release
- Workflow Automation - GitHub Actions for all automation tasks
- Policy Versioning - Timestamped backups with diff viewer
- Config Hot-Reload - Zero-downtime policy updates
- Live Dashboard - Auto-generated status with CI/PyPI badges (hourly refresh)
- Interactive Cockpit - 10+ Streamlit panels for platform control
- Platform Overview
- Streaming Health
- Runtime Audit
- Auto-Deploy Control
- Policy Backups & Restore
- Scheduler Control
- Canary Rollouts
- Canary Insights
- SLO Status
- AI Suggestions
- Prometheus Metrics - 50+ custom metrics across deployment lifecycle
- Grafana Dashboards - Pre-configured visualization dashboards
- Alert Rules - 20+ alert rules for proactive monitoring
- Version Mismatch Detection - Automatic detection of registry vs PyPI drift
- Log Aggregation - Loki integration for centralized logging
- Makefile Commands - Simple
maketargets for common operations - Validation Scripts - Automated health checks and smoke tests
- Comprehensive Documentation - 20+ guides for setup, usage, and troubleshooting
- Local Development - Full stack runs locally with Docker
- CI/CD Ready - Workflows for testing and deployment
- Hot Reload Support - Development mode with live code updates
- Docker 24.0+ with Docker Compose v2
- Git for cloning repositories
- Make (optional, but recommended)
- Python 3.11+ (for local scripts)
- curl and jq (for API testing)
-
Copy the environment template:
cp github-runner.env.example .env
-
Configure required variables:
# Essential REPO_ACCESS_TOKEN=ghp_your_github_token_here GITHUB_WEBHOOK_SECRET=your_webhook_secret_here # Optional (defaults provided) POSTGRES_PASSWORD=postgres GRAFANA_ADMIN_PASSWORD=admin OPENAI_API_KEY=sk_your_openai_key # For AI Advisor LLM mode
-
Generate GitHub PAT:
- Go to GitHub Settings > Developer settings > Personal access tokens
- Create a new token with scopes:
repo,workflow,write:packages - Copy the token to
REPO_ACCESS_TOKENin your.envfile
- Clone the infrastructure repository:
git clone https://github.com/mjdevaccount/market_data_infra.git
cd market_data_infra- Clone sibling service repositories:
cd ..
git clone https://github.com/mjdevaccount/schema-registry-service.git
git clone https://github.com/mjdevaccount/market-data-core.git
git clone https://github.com/mjdevaccount/market-data-store.git
git clone https://github.com/mjdevaccount/market_data_pipeline.git
git clone https://github.com/mjdevaccount/market_data_ibkr.git
git clone https://github.com/mjdevaccount/market_data_orchestrator.git
git clone https://github.com/mjdevaccount/market_data_cockpit.git
cd market_data_infraExpected directory structure:
parent_directory/
├── market_data_infra/ # ⭐ This repo
├── schema-registry-service/
├── market_data_core/
├── market_data_store/
├── market_data_pipeline/
├── market_data_ibkr/
├── market_data_orchestrator/
└── market_data_cockpit/
# Start everything
make up
# Or with Docker Compose directly
docker compose --profile infra --profile core --profile store \
--profile pipeline --profile orchestrator --profile cockpit up -d# Infrastructure only (database, monitoring, control plane)
make up-infra
# Core platform services
docker compose --profile core --profile store up -d
# Processing layer
docker compose --profile pipeline up -d
# Control panels
make up-cockpit# Check service health
docker ps
# Test deployment API
curl http://localhost:8000/health
# Test AI Advisor
curl http://localhost:8086/health
# Access dashboards
open http://localhost:8505 # Cockpit UI
open http://localhost:3000 # Grafana (admin/admin)
open http://localhost:9090 # Prometheus
open http://localhost:8000/metrics # Deployment metrics# Generate AI suggestions
curl -X POST http://localhost:8000/runtime/ai/scan
# View suggestions
curl http://localhost:8000/runtime/ai/suggestions | jq
# Trigger a manual deployment
curl -X POST http://localhost:8000/runtime/deploy/execute/cockpit/v1.0.0
# Check deployment status
curl http://localhost:8000/runtime/deploy/status | jq💡 Pro Tip: Start with the Cockpit UI for a visual introduction to all platform capabilities!
# Stop all services
make down
# Stop and remove volumes (⚠️ DATA LOSS)
make nuke┌─────────────────────────────────────────────────────────────────────────────┐
│ MARKET DATA PLATFORM (Phase 13) │
│ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ DEPLOYMENT INTELLIGENCE LAYER │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ ┌────────────┐ │ │
│ │ │ AI Advisor │ │ infra_web │ │ Cockpit │ │ Prometheus │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ • Heuristic │◀─│ • Webhooks │◀─│ • 10 Panels │◀─│ • Metrics │ │ │
│ │ │ • LLM Ready │ │ • Auto- │ │ • Real-time │ │ • Alerts │ │ │
│ │ │ • Suggest │ │ Deploy │ │ • Control │ │ • SLOs │ │ │
│ │ │ Tuning │ │ • Canary │ │ • Insights │ │ • Guards │ │ │
│ │ └─────────────┘ └─────────────┘ └──────────────┘ └────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ APPLICATION LAYER │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ PROVIDERS │ │ PIPELINE │ │ ORCHESTRATOR │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ market_data │─▶│ market-data- │─▶│ market-data- │ │ │
│ │ │ _ibkr │ │ pipeline │ │ orchestrator │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ • IBKR feed │ │ • Validator │ │ • Dashboards │ │ │
│ │ │ • Synthetic │ │ • Enforcer │ │ • Telemetry │ │ │
│ │ │ • Replay │ │ • Rules │ │ • Control │ │ │
│ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │
│ │ │ │ │ │ │
│ │ └─────────┬───────┴─────────┬───────┘ │ │
│ │ ▼ ▼ │ │
│ │ ┌──────────────────────────────────┐ │ │
│ │ │ CORE SDK & REGISTRY │ │ │
│ │ │ │ │ │
│ │ │ market-data-core │ │ │
│ │ │ schema-registry-service │ │ │
│ │ └──────────────┬───────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────────────────────────────┐ │ │
│ │ │ DATA STORE & DATABASE │ │ │
│ │ │ │ │ │
│ │ │ market-data-store │ │ │
│ │ │ PostgreSQL/TimescaleDB │ │ │
│ │ └──────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ INFRASTRUCTURE & OBSERVABILITY │ │
│ │ │ │
│ │ Prometheus → Grafana → Loki → Redis → Cockpit │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────┐
│ GitHub Release │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Webhook Event │ ──HMAC──▶ Signature Validation
└────────┬────────┘
│
▼
┌─────────────────┐
│ Auto-Scheduler │ ◀──── Continuous Registry Scan (60s)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Policy Check │ ──auto_deploy: true?──▶ Queue or Manual
└────────┬────────┘
│ YES
▼
┌─────────────────┐
│ Precheck Guards │ ──Cluster Health? Alerts?──▶ Abort if unsafe
└────────┬────────┘
│ PASS
▼
┌─────────────────┐
│ Canary Deploy │
│ Step 1: 10% │ ──Wait 60s──▶ Evaluate Guards
│ Step 2: 50% │ ──Wait 120s──▶ Evaluate Guards
│ Step 3: 100% │ ──Wait 180s──▶ Evaluate Guards
└────────┬────────┘
│
┌────┴────┐
│ │
PASS FAIL
│ │
▼ ▼
┌────────┐ ┌─────────────┐
│Success │ │ Rollback │
│ │ │ + Disable │
└────┬───┘ └──────┬──────┘
│ │
▼ ▼
┌──────────────────────┐
│ SLO Monitoring │ ◀──── Every 60s
│ (Continuous) │
│ │
│ 3 violations → Disable
│ 3 passes → Reinstate │
└──────────────────────┘
│
▼
┌──────────────────────┐
│ AI Advisor │
│ Analyzes Metrics │
│ Suggests Tuning │
│ │
│ Human Review │
│ Apply Changes │
└──────────────────────┘
1. INGESTION
market_data_ibkr → market-data-core
(Provider feeds data to SDK)
2. VALIDATION
market-data-core → market-data-pipeline
(SDK publishes to pipeline for validation)
3. ENFORCEMENT
market-data-pipeline → schema-registry-service
(Pipeline checks against registered schemas)
4. STORAGE
market-data-pipeline → market-data-store → postgres
(Valid data persisted to TimescaleDB)
5. DRIFT DETECTION
market-data-store → schema-registry-service
(Store reports schema drift)
6. MONITORING
All services → prometheus → grafana → cockpit
(Metrics collected, visualized, and controlled)
7. DEPLOYMENT
infra_web → docker compose → health checks → prometheus guards
(Automated deployment with health verification)
8. AI OPTIMIZATION
ai_advisor → prometheus → suggestions → infra_web → policies
(ML-driven threshold tuning)
# 1. View current deployment policies
curl http://localhost:8000/runtime/deploy/plan | jq
# 2. Start a canary rollout
curl -X POST http://localhost:8000/runtime/deploy/canary/start \
-H "Content-Type: application/json" \
-d '{"service":"cockpit","tag":"v1.2.3"}' | jq
# 3. Monitor progress
curl http://localhost:8000/runtime/deploy/canary/status | jq
# 4. Abort if needed
curl -X POST http://localhost:8000/runtime/deploy/canary/abort/cockpit# 1. Scan services for optimization opportunities
curl -X POST http://localhost:8000/runtime/ai/scan | jq
# 2. View AI suggestions
curl http://localhost:8000/runtime/ai/suggestions | jq
# Example response:
# {
# "suggestions": {
# "cockpit": [
# {
# "kind": "threshold_adjust",
# "path": "services.cockpit.rollout.guards[p95_latency_ms].threshold",
# "from_value": 600.0,
# "to_value": 520.5,
# "rationale": "Observed median p95=473.2ms over 30m; nudging threshold toward 110% of observed.",
# "confidence": 0.78
# }
# ]
# }
# }
# 3. Apply suggestions (via Cockpit UI or API)
curl -X POST http://localhost:8000/runtime/ai/apply/cockpit
# 4. Or reject
curl -X POST http://localhost:8000/runtime/ai/reject/cockpit# Check SLO compliance
curl http://localhost:8000/runtime/deploy/canary/status | jq '.rollouts'
# View SLO health metrics
curl http://localhost:8000/metrics | grep infra_slo_health_ratio
# Example metrics:
# infra_slo_health_ratio{service="cockpit",slo="availability"} 1.0
# infra_slo_health_ratio{service="cockpit",slo="latency_p95"} 1.0
# infra_slo_health_ratio{service="cockpit",slo="error_rate"} 1.0# Check scheduler status
curl http://localhost:8000/runtime/deploy/auto/status | jq
# Pause auto-deployments
curl -X POST http://localhost:8000/runtime/deploy/auto/pause
# Resume
curl -X POST http://localhost:8000/runtime/deploy/auto/resume
# Force immediate scan
curl -X POST http://localhost:8000/runtime/deploy/auto/run
# Update scan interval
curl -X POST http://localhost:8000/runtime/deploy/auto/interval/120market_data_infra/
├── .github/
│ └── workflows/
│ ├── on_downstream_release.yml # Release notifications
│ ├── platform_rebuild.yml # Pinned compose generation
│ ├── dashboard_sync.yml # Hourly dashboard refresh
│ ├── nightly_version_check.yml # Daily PyPI validation
│ └── phase13_*.yml # Deployment automation tests
│
├── ai_advisor/ # 🆕 AI-powered policy tuning
│ ├── src/
│ │ └── main.py # FastAPI service
│ ├── Dockerfile
│ └── requirements.txt
│
├── build/
│ └── docker-compose.platform.yml # Auto-generated pinned compose
│
├── cockpit/ # 🆕 Enhanced control panel
│ ├── ui/
│ │ ├── auto_deploy.py # Auto-deploy control
│ │ ├── policy_backups.py # Policy backup/restore
│ │ ├── scheduler_control.py # Scheduler management
│ │ ├── canary_rollouts.py # Canary monitoring
│ │ ├── canary_insights.py # Analytics dashboard
│ │ ├── slo_status.py # SLO compliance
│ │ └── ai_suggestions.py # AI recommendations
│ ├── app.py # Main Streamlit app
│ └── Dockerfile
│
├── configs/ # 🆕 Declarative configuration
│ ├── deploy_policies.yaml # Deployment rules
│ ├── deploy_scheduler.yaml # Scheduler config
│ └── backups/ # Versioned policy backups
│
├── docker/
│ ├── compose/ # Service-specific compose files
│ └── initdb.d/ # PostgreSQL init scripts
│
├── docs/
│ ├── AUTOMATION_INFRASTRUCTURE.md
│ ├── PHASE_4_CONTROL_PLANE.md
│ ├── PHASE_13_2_DEPLOY_GUIDE.md # 🆕 Manual deployment
│ ├── PHASE_13_3_AUTO_DEPLOY_GUIDE.md # 🆕 Auto-deploy setup
│ ├── PHASE_13_4_SCHEDULER_GUIDE.md # 🆕 Scheduler config
│ ├── PHASE_13_5_CANARY_GUIDE.md # 🆕 Canary rollouts
│ ├── PHASE_13_6_ADAPTIVE_GUIDE.md # 🆕 Adaptive intelligence
│ ├── PHASE_13_7_SLO_GUIDE.md # 🆕 SLO monitoring
│ ├── PHASE_13_8_AI_ADVISOR_GUIDE.md # 🆕 AI tuning
│ ├── PLATFORM_OVERVIEW.md
│ ├── QUICK_REFERENCE.md
│ └── TROUBLESHOOTING.md
│
├── monitoring/
│ ├── grafana/
│ │ ├── dashboards/
│ │ └── provisioning/
│ └── prometheus/
│ ├── prometheus.yml
│ ├── alerts.yml
│ └── rules/ # 🆕 Deployment alert rules
│ ├── alerts-deploy-canary.yml
│ ├── alerts-canary-intel.yml
│ ├── alerts-slo.yml
│ └── alerts-ai-advisor.yml
│
├── registry/
│ └── versions.json # Version registry (auto-updated)
│
├── scripts/
│ ├── deploy_latest.sh # 🆕 Deployment script
│ ├── rollback.sh # 🆕 Rollback script
│ ├── verify_stack.sh # 🆕 Health verification
│ ├── auto_deploy_test.sh # 🆕 Webhook simulator
│ ├── simulate_registry_bump.sh # 🆕 Registry test
│ ├── dashboard_generator.py
│ ├── registry_updater.py
│ └── health-check.sh
│
├── src/
│ └── infra_web/ # 🆕 Deployment orchestration
│ ├── routes/
│ │ ├── deploy.py # Manual deploy API
│ │ ├── deploy_webhook.py # Webhook handler
│ │ ├── deploy_policies.py # Policy management
│ │ ├── deploy_auto.py # Scheduler control
│ │ ├── deploy_canary.py # Canary API
│ │ └── ai.py # AI Advisor API
│ ├── services/
│ │ ├── deploy_controller.py # Deploy logic
│ │ ├── webhook_controller.py # Webhook validation
│ │ ├── autodeploy_worker.py # Background worker
│ │ ├── auto_scheduler.py # Registry scanner
│ │ ├── canary_engine.py # Canary orchestration
│ │ ├── guard_evaluator.py # Prometheus guards
│ │ └── ai_client.py # AI Advisor client
│ ├── state/
│ │ ├── deploy_state.py # Deployment state
│ │ └── scheduler_state.py # Scheduler state
│ ├── main.py # FastAPI app
│ ├── Dockerfile
│ └── requirements.txt
│
├── state/ # 🆕 Runtime state
│ └── last_good.json # Last-known-good versions
│
├── docker-compose.yml # Main orchestration
├── Makefile # Management commands
├── DASHBOARD.md # Auto-generated status
├── STACK_STATUS.md # Version registry
└── README.md # This file
Edit configs/deploy_policies.yaml to control deployment behavior:
services:
cockpit:
repo: mjdevaccount/market_data_cockpit
auto_deploy: true # Enable auto-deployment
verify_url: http://cockpit:8505/health
verify_wait_seconds: 30
# SLO definitions
slo:
availability:
query: 'sum(rate(...)) / sum(rate(...))'
operator: '>='
threshold: 0.995 # 99.5% uptime
latency_p95:
query: '1000 * histogram_quantile(...)'
operator: '<='
threshold: 600 # 600ms max
slo_policy:
max_violations: 3 # Disable after 3 SLO failures
reinstate_after_passes: 3 # Re-enable after 3 successes
# Canary rollout strategy
rollout:
strategy: canary
steps:
- pct: 0.1 # Deploy to 10%
wait_seconds: 60
guards:
- name: error_rate
threshold: 0.02 # Loose threshold initially
- pct: 0.5 # Deploy to 50%
wait_seconds: 120
guards:
- name: error_rate
threshold: 0.015 # Tighten threshold
- name: p95_latency_ms
threshold: 600
- pct: 1.0 # Full deployment
wait_seconds: 180
guards:
- name: error_rate
threshold: 0.01 # Strictest threshold
precheck_guards:
- name: cluster_health
query: 'up == 1'
operator: '>='
threshold: 0.95 # 95% of targets must be upEdit configs/deploy_scheduler.yaml:
interval_seconds: 60 # Scan interval
initial_delay_seconds: 5
max_per_hour: 6 # Rate limit per service
quiet_hours:
start_hour_utc: 2 # 02:00 UTC
end_hour_utc: 5 # 05:00 UTC
registry_file: "/app/registry/versions.json"
resolve_strategy: "service" # "service" | "repo_short"# Database
POSTGRES_DB=market_data
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_PORT=5433
# Registry
REGISTRY_URL=http://registry:8000
REGISTRY_TRACK=latest
# Deployment
GITHUB_WEBHOOK_SECRET=your_secret_here
WEBHOOK_ALLOWLIST=github.com,api.github.com
# AI Advisor
ADVISOR_MODE=heuristic # "heuristic" | "llm"
OPENAI_API_KEY=sk_your_key # For LLM mode
OLLAMA_URL=http://ollama:11434 # For local LLM
# Monitoring
PROMETHEUS_URL=http://prometheus:9090
GRAFANA_ADMIN_PASSWORD=admin# Deployment attempts
infra_deploy_attempts_total{service="cockpit"}
# Success rate
rate(infra_deploy_success_total[1h]) / rate(infra_deploy_attempts_total[1h])
# Rollback frequency
increase(infra_deploy_rollback_total[24h])
# Canary health
infra_canary_guard_fail_total{service="cockpit",guard="error_rate"}
infra_canary_current_step{service="cockpit"}
infra_canary_duration_seconds_bucket
# SLO compliance
infra_slo_health_ratio{service="cockpit",slo="availability"}
infra_slo_violation_total{service="cockpit"}
# Auto-deploy queue
infra_autodeploy_queue_depth
infra_autodeploy_success_total
infra_autodeploy_rollback_total
# AI Advisor
infra_ai_suggestions_total{service="cockpit"}
infra_ai_applied_total{service="cockpit",kind="threshold_adjust"}
infra_ai_rejected_total{service="cockpit"}
# Scheduler
infra_autosched_runs_total
infra_autosched_paused
infra_autosched_last_run_timestamp
Pre-configured dashboards available at http://localhost:3000:
- Platform Overview - Service health, versions, uptime
- Deployment Intelligence - Canary metrics, rollout duration, success rates
- SLO Dashboard - SLO compliance, violation trends
- AI Advisor Metrics - Suggestion quality, application rate
- Prometheus Alerts - Active alerts, firing history
# Test manual deployment
curl -X POST http://localhost:8000/runtime/deploy/execute/cockpit/v1.0.0
# Test auto-deploy webhook
./scripts/auto_deploy_test.sh market_data_cockpit v1.2.3
# Simulate registry update
./scripts/simulate_registry_bump.sh /app/registry/versions.json cockpit v1.2.4
# Test canary rollout
curl -X POST http://localhost:8000/runtime/deploy/canary/start \
-H "Content-Type: application/json" \
-d '{"service":"cockpit","tag":"v1.2.5"}'
# Test AI suggestions
curl -X POST http://localhost:8000/runtime/ai/scan
curl http://localhost:8000/runtime/ai/suggestions | jq# Run complete health check
./scripts/verify_stack.sh
# Check deployment API
curl http://localhost:8000/health
# Check AI Advisor
curl http://localhost:8086/health
# Verify Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Check metrics endpoint
curl http://localhost:8000/metrics | grep infra_| Document | Description |
|---|---|
| README.md | This file - comprehensive overview |
| QUICK_REFERENCE.md | Quick command reference |
| PLATFORM_OVERVIEW.md | Architecture deep-dive |
| TROUBLESHOOTING.md | Common issues and solutions |
| Phase 13 Guides | Deployment Intelligence |
| PHASE_13_2_DEPLOY_GUIDE.md | Manual deployment setup |
| PHASE_13_3_AUTO_DEPLOY_GUIDE.md | Auto-deploy configuration |
| PHASE_13_4_SCHEDULER_GUIDE.md | Autonomous scheduler |
| PHASE_13_5_CANARY_GUIDE.md | Canary rollout strategies |
| PHASE_13_6_ADAPTIVE_GUIDE.md | Adaptive intelligence |
| PHASE_13_7_SLO_GUIDE.md | SLO-based monitoring |
| PHASE_13_8_AI_ADVISOR_GUIDE.md | AI-powered tuning |
| Control Plane | Platform Management |
| COCKPIT_QUICK_START.md | 5-minute cockpit setup |
| COCKPIT_ROLLOUT_GUIDE.md | Comprehensive deployment |
| AUTOMATION_INFRASTRUCTURE.md | Automation system details |
| Generated | Auto-Updated |
| DASHBOARD.md | Live platform status |
| STACK_STATUS.md | Version registry |
We welcome contributions! This is an actively developed project with continuous enhancements.
- Fork and clone:
git clone https://github.com/YOUR_USERNAME/market_data_infra.git
cd market_data_infra- Create a feature branch:
git checkout -b feature/phase-13.x-your-feature- Make changes and test:
# Start platform
docker compose --profile infra up -d
# Test your changes
curl http://localhost:8000/your-endpoint
# Check logs
docker logs infra_web- Run validation:
# Health checks
./scripts/verify_stack.sh
# Linting (if applicable)
cd src/infra_web && pylint **/*.py- Commit with conventional commits:
git add .
git commit -m "feat(phase13.x): add your feature description"- Push and create PR:
git push origin feature/phase-13.x-your-featureWe're particularly interested in contributions for:
- 🤖 LLM Integration - OpenAI/Ollama for AI Advisor
- ☸️ Kubernetes Deployment - Helm charts, operators
- 🔐 Security Enhancements - RBAC, secrets management
- 📊 Additional Metrics - Custom exporters, dashboards
- 🧪 Testing - Integration tests, chaos engineering
- 📚 Documentation - Guides, tutorials, examples
- Python: Follow PEP 8, use type hints
- YAML: 2-space indentation
- Shell: Use
shellcheckfor validation - Commits: Conventional commits (
feat:,fix:,docs:, etc.)
Problem: Canary rollout stuck in "waiting" state
Solution:
# Check guard evaluation
curl http://localhost:8000/runtime/deploy/canary/status | jq '.rollouts.cockpit.guard_results'
# View Prometheus metrics
curl 'http://localhost:9090/api/v1/query?query=infra_canary_current_step{service="cockpit"}'
# Force abort if needed
curl -X POST http://localhost:8000/runtime/deploy/canary/abort/cockpitProblem: SLO violations causing auto-disable
Solution:
# Check SLO status
curl http://localhost:8000/metrics | grep infra_slo_health_ratio
# Review violation history
curl 'http://localhost:9090/api/v1/query?query=increase(infra_slo_violation_total[1h])'
# Manually re-enable after fixing
# Edit configs/deploy_policies.yaml, set auto_deploy: trueProblem: AI Advisor not generating suggestions
Solution:
# Check AI Advisor health
curl http://localhost:8086/health
# Verify Prometheus connectivity
docker exec ai_advisor curl http://prometheus:9090/-/healthy
# Review AI Advisor logs
docker logs ai_advisor
# Test with specific service
curl -X POST http://localhost:8086/ai/evaluate \
-H "Content-Type: application/json" \
-d '{"service":"cockpit","current_policy":{...}}'For comprehensive troubleshooting, see TROUBLESHOOTING.md.
This project is licensed under the MIT License - see the LICENSE file for details.
- TimescaleDB - Time-series database excellence
- Prometheus - Metrics collection and alerting
- Grafana - Beautiful visualizations
- Streamlit - Rapid dashboard development
- FastAPI - Modern Python web framework
- Docker - Containerization platform
- GitHub Actions - CI/CD automation
- NumPy - Scientific computing for AI Advisor
Current Phase: 13.8 - AI-Powered Deployment Intelligence
Status: ✅ PRODUCTION READY
Last Updated: October 25, 2025
- Services: 12 (6 core + 4 infrastructure + 2 intelligence)
- API Endpoints: 50+ across deployment lifecycle
- Automation: 8+ GitHub Actions workflows
- Documentation: 20+ comprehensive guides
- Prometheus Metrics: 50+ custom metrics
- Alert Rules: 25+ proactive monitoring rules
- Test Coverage: Health checks, smoke tests, validation scripts
- Deployment Strategies: Manual, auto-deploy, canary, SLO-based
- ✅ Phase 1-3: Infrastructure & Platform Integration
- ✅ Phase 4: Control Plane & Cockpit
- ✅ Phase 13.2: Runtime Deploy & Self-Update
- ✅ Phase 13.3: Auto-Deploy & Webhook Integration
- ✅ Phase 13.4: Autonomous Scheduler
- ✅ Phase 13.5: Canary Rollouts
- ✅ Phase 13.6: Adaptive Canary Intelligence
- ✅ Phase 13.7: SLO-Based Rollback & Auto-Reinstate
- ✅ Phase 13.8: AI Advisor Service
- 🚧 Phase 14: Cloud-Native Deployment (Kubernetes, Helm)
- 📋 Phase 15: Multi-Region & HA
October 2025 - Deployment Intelligence Revolution:
-
Runtime Deploy System (13.2)
- Manual deployment API
- Health verification & rollback
- Prometheus metrics integration
- State management
-
Auto-Deploy Engine (13.3)
- GitHub webhook integration
- Policy-driven automation
- Threaded deployment queue
- Automatic rollback on failure
-
Autonomous Scheduler (13.4)
- Background registry scanner
- Rate limiting & quiet hours
- Live control API
- Real-time status dashboard
-
Canary Rollouts (13.5)
- Step-wise progressive deployment
- PromQL guard evaluation
- Automatic rollback on guard failure
- Cockpit integration
-
Adaptive Intelligence (13.6)
- Precheck guard system
- Step-specific threshold tuning
- Auto-disable on repeated failures
- Historical insights dashboard
-
SLO Monitoring (13.7)
- Continuous SLO evaluation
- Auto-disable on violations
- Auto-reinstate on recovery
- Real-time compliance tracking
-
AI Advisor (13.8)
- Heuristic threshold analysis
- LLM-ready architecture
- Prometheus-driven insights
- Human-in-the-loop approval
Languages:
- Python 3.11+
- Bash
- YAML
Frameworks:
- FastAPI (API services)
- Streamlit (UI dashboards)
- Prometheus (metrics)
- Docker Compose (orchestration)
Databases:
- PostgreSQL 15
- TimescaleDB 2.20
- Redis 7.x
Monitoring:
- Prometheus 2.x
- Grafana 10.x
- Loki 2.x
AI/ML:
- NumPy (data analysis)
- OpenAI API (future)
- Ollama (future)
- Infrastructure Hub: market_data_infra ⭐
- Platform Services:
- Documentation: docs/
- API Reference: http://localhost:8000/docs (when running)
- Metrics: http://localhost:8000/metrics
- Grafana: http://localhost:3000
- Prometheus: http://localhost:9090
Built with ❤️ for production-grade, AI-powered market data infrastructure
🚀 Self-Healing • 🤖 AI-Optimized • 📊 Fully Observable • ☸️ Cloud-Ready