ColonyOS is an open-source framework for seamless execution of computational workloads across heterogeneous platforms - cloud, edge, HPC, IoT devices, and beyond. It creates Compute Continuums by providing a unified orchestration layer that operates as a meta-orchestrator on top of existing infrastructure.
Traditional orchestration systems are tied to specific platforms (Kubernetes for cloud, Slurm for HPC, etc.). ColonyOS breaks these silos through meta-process management - a broker-based architecture that separates computational intent from execution.
Example of use cases:
- Scientific Computing: Process satellite imagery, analyze sensor data, run simulations across HPC clusters
- AI/ML Pipelines: Distribute training jobs, run inference on edge devices, orchestrate multi-agent LLM systems
- Serverless at Scale: Build FaaS platforms that span cloud, edge, and on-premise infrastructure
- Data Processing: ETL pipelines, batch processing, real-time stream processing with ColonyFS integration
- Industrial IoT: Coordinate computations across factory floor devices, edge gateways, and cloud
- Earth Observation: Automated satellite image processing and analysis workflows
- Infrastructure as Code: Declaratively manage infrastructure across computing continuums - define resources spanning cloud, edge, HPC, and IoT with GitOps workflows, automatic drift detection, and self-healing reconciliation
Declarative Intent + Broker + Distributed Execution = Computing Continuums
Instead of writing platform-specific code, you declare WHAT you want to compute using Function Specifications. The Colonies Server acts as a broker that matches your intent with available Executors (distributed workers) that know HOW to execute on their specific platforms. This separation creates seamless Computing Continuums across heterogeneous infrastructure.
- Platform Agnostic: Same function specification runs on Kubernetes, HPC, edge devices, IoT - executors translate to platform-specific execution
- Decoupled Architecture: Submit work anytime, execute asynchronously - temporal and spatial decoupling via broker
- Zero-Trust by Design: No session tokens, no passwords - every request cryptographically signed with Ed25519
- Protocol Flexibility: Choose HTTP/REST, gRPC, CoAP (IoT), or LibP2P (P2P) - or run them all simultaneously
- Pull-Based Execution: Executors connect from anywhere (even behind NAT/firewalls) and pull work - no need for inbound access
- Built-in Audit Trail: Every execution recorded as an immutable ledger for compliance and debugging
- Real-Time Reactive: WebSocket subscriptions for instant notifications on workflow state changes
- Multi-Protocol Architecture: Native support for HTTP/REST, gRPC, CoAP (IoT), and LibP2P (peer-to-peer)
- Distributed Execution: Executors run anywhere on the Internet - supercomputers, edge devices, browsers, embedded systems
- Zero-Trust Security: All communication cryptographically signed with Ed25519
- Workflow DAGs: Complex computational pipelines with parent-child dependencies
- Event-Driven: Real-time WebSocket subscriptions for process state changes
- Scheduled Execution: Cron-based and interval-based job scheduling
- Dynamic Batching: Generators that pack arguments and trigger workflows based on counter or timeout conditions
- Resource Reconciliation: Kubernetes-style declarative resource management with automatic drift detection and correction
- Full Audit Trail: Complete execution history stored as an immutable ledger
- High Availability: Etcd-based clustering with automatic failover
- Multi-Language SDKs: Go, Rust, Python, Julia, JavaScript, Haskell
- Colony: A distributed runtime environment - a network of loosely connected Executors
- Executor: Distributed worker that pulls and executes workloads (can be implemented in any language, runs anywhere)
- Process: Computational workload with states: WAITING → RUNNING → SUCCESS/FAILED
- FunctionSpec: Specification defining what computation to run and execution conditions
- ProcessGraph: Workflow represented as a Directed Acyclic Graph (DAG)
- Resource: Declarative infrastructure specification with desired state management
- Reconciliation: Automatic drift detection and correction that maintains resources in their desired state
- Submit: Users submit function specifications to the Colonies server
- Schedule: The scheduler assigns processes to available Executors based on conditions
- Execute: Executors pull assigned processes, execute them, and report results
- Chain: Complex workflows span multiple platforms by chaining processes together
- Monitor: Real-time subscriptions and full execution history enable observability
Colonies implements a zero-trust architecture where all communication is cryptographically signed:
- No traditional authentication tokens or session management
- Each request signed with Ed25519 private keys
- Server validates signatures and enforces role-based access control
- Executors can operate on untrusted infrastructure while maintaining security
Run Colonies server with any combination of protocols:
| Backend | Use Case | Port |
|---|---|---|
| HTTP/REST | Web APIs, dashboards, traditional clients | 8080 |
| gRPC | High-performance, low-latency communication | 50051 |
| CoAP | IoT devices, constrained environments | 5683 |
| LibP2P | Peer-to-peer, decentralized, NAT traversal | 4001 |
Configure via environment variable:
export COLONIES_SERVER_BACKENDS="http,grpc,libp2p" # Run multiple protocols simultaneouslyComprehensive step-by-step tutorials are available in the tutorials repository:
The Colonies Dashboard provides a web UI for monitoring and managing your compute continuum:
- Installation Guide - Install and configure Colonies
- Getting Started - Your first Colonies application
- Configuration - Environment variables and settings
- Backend Configuration - HTTP, gRPC, CoAP, LibP2P setup
- Introduction - Core concepts and architecture
- Implementing Executors - Create executors in Python, Go, Julia, JavaScript
- Fibonacci Tutorial (Go) - Complete example application
- Workflow DAGs - Create complex computational pipelines
- Generators - Batch processing and dynamic workflows
- Cron Jobs - Schedule recurring tasks
- CLI Usage - Command-line interface reference
- Logging - Process logging and monitoring
- Overall Design - System architecture and design patterns
- RPC Protocol - HTTP RPC protocol specification
- Security Design - Zero-trust security model
- Container Building - Build Docker containers for single and multi-platform
- High-Availability Deployment - Production cluster setup
- Monitoring - Grafana and Prometheus integration
- Kubernetes Helm Charts - Deploy on Kubernetes
- Go SDK - Official Go client library
- Python SDK - Python client library
- Rust SDK - Rust client library
- Julia SDK - Julia client library
- JavaScript SDK - JavaScript/Node.js library
- Haskell SDK - Haskell client library
- Executors - Pre-built executor implementations
make build # Build the main colonies binary
make container # Build Docker container for local architecture
make container-multiplatform # Build for amd64 and arm64
make install # Install to /usr/local/binFor detailed instructions on building containers including multi-platform builds, see the Container Building Guide.
make test # Run all tests
make github_test # Run tests for CI (no color output)
# Test specific backends
COLONIES_BACKEND_TYPE=gin make test
COLONIES_BACKEND_TYPE=grpc make test
COLONIES_BACKEND_TYPE=libp2p make testmake coverage # Generate coverage reportsColonyOS is currently used in production by:
- RockSigma AB - Automatic seismic processing engine for underground mines, orchestrating workloads across cloud and edge infrastructure
Contributions are welcome! Please see our contributing guidelines and code of conduct.
- Website: colonyos.io
- GitHub: github.com/colonyos
- Tutorials: github.com/colonyos/tutorials
See LICENSE file for details.




