ColonyOS - Distributed Meta-Orchestrator

ColonyOS is an open-source framework for seamless execution of computational workloads across heterogeneous platforms - cloud, edge, HPC, IoT devices, and beyond. It creates Compute Continuums by providing a unified orchestration layer that operates as a meta-orchestrator on top of existing infrastructure.

Why ColonyOS?

Traditional orchestration systems are tied to specific platforms (Kubernetes for cloud, Slurm for HPC, etc.). ColonyOS breaks these silos through meta-process management - a broker-based architecture that separates computational intent from execution.

Example of use cases:

Scientific Computing: Process satellite imagery, analyze sensor data, run simulations across HPC clusters
AI/ML Pipelines: Distribute training jobs, run inference on edge devices, orchestrate multi-agent LLM systems
Serverless at Scale: Build FaaS platforms that span cloud, edge, and on-premise infrastructure
Data Processing: ETL pipelines, batch processing, real-time stream processing with ColonyFS integration
Industrial IoT: Coordinate computations across factory floor devices, edge gateways, and cloud
Earth Observation: Automated satellite image processing and analysis workflows
Infrastructure as Code: Declaratively manage infrastructure across computing continuums - define resources spanning cloud, edge, HPC, and IoT with GitOps workflows, automatic drift detection, and self-healing reconciliation

The Core Idea

Declarative Intent + Broker + Distributed Execution = Computing Continuums

Instead of writing platform-specific code, you declare WHAT you want to compute using Function Specifications. The Colonies Server acts as a broker that matches your intent with available Executors (distributed workers) that know HOW to execute on their specific platforms. This separation creates seamless Computing Continuums across heterogeneous infrastructure.

Key Advantages

Platform Agnostic: Same function specification runs on Kubernetes, HPC, edge devices, IoT - executors translate to platform-specific execution
Decoupled Architecture: Submit work anytime, execute asynchronously - temporal and spatial decoupling via broker
Zero-Trust by Design: No session tokens, no passwords - every request cryptographically signed with Ed25519
Protocol Flexibility: Choose HTTP/REST, gRPC, CoAP (IoT), or LibP2P (P2P) - or run them all simultaneously
Pull-Based Execution: Executors connect from anywhere (even behind NAT/firewalls) and pull work - no need for inbound access
Built-in Audit Trail: Every execution recorded as an immutable ledger for compliance and debugging
Real-Time Reactive: WebSocket subscriptions for instant notifications on workflow state changes

Key Features

Multi-Protocol Architecture: Native support for HTTP/REST, gRPC, CoAP (IoT), and LibP2P (peer-to-peer)
Distributed Execution: Executors run anywhere on the Internet - supercomputers, edge devices, browsers, embedded systems
Zero-Trust Security: All communication cryptographically signed with Ed25519
Workflow DAGs: Complex computational pipelines with parent-child dependencies
Event-Driven: Real-time WebSocket subscriptions for process state changes
Scheduled Execution: Cron-based and interval-based job scheduling
Dynamic Batching: Generators that pack arguments and trigger workflows based on counter or timeout conditions
Resource Reconciliation: Kubernetes-style declarative resource management with automatic drift detection and correction
Full Audit Trail: Complete execution history stored as an immutable ledger
High Availability: Etcd-based clustering with automatic failover
Multi-Language SDKs: Go, Rust, Python, Julia, JavaScript, Haskell

Architecture

Core Concepts

Colony: A distributed runtime environment - a network of loosely connected Executors
Executor: Distributed worker that pulls and executes workloads (can be implemented in any language, runs anywhere)
Process: Computational workload with states: WAITING → RUNNING → SUCCESS/FAILED
FunctionSpec: Specification defining what computation to run and execution conditions
ProcessGraph: Workflow represented as a Directed Acyclic Graph (DAG)
Resource: Declarative infrastructure specification with desired state management
Reconciliation: Automatic drift detection and correction that maintains resources in their desired state

How It Works

Submit: Users submit function specifications to the Colonies server
Schedule: The scheduler assigns processes to available Executors based on conditions
Execute: Executors pull assigned processes, execute them, and report results
Chain: Complex workflows span multiple platforms by chaining processes together
Monitor: Real-time subscriptions and full execution history enable observability

Zero-Trust Security Model

Colonies implements a zero-trust architecture where all communication is cryptographically signed:

No traditional authentication tokens or session management
Each request signed with Ed25519 private keys
Server validates signatures and enforces role-based access control
Executors can operate on untrusted infrastructure while maintaining security

Multi-Backend Support

Run Colonies server with any combination of protocols:

Backend	Use Case	Port
HTTP/REST	Web APIs, dashboards, traditional clients	8080
gRPC	High-performance, low-latency communication	50051
CoAP	IoT devices, constrained environments	5683
LibP2P	Peer-to-peer, decentralized, NAT traversal	4001

Configure via environment variable:

export COLONIES_SERVER_BACKENDS="http,grpc,libp2p"  # Run multiple protocols simultaneously

Tutorials

Comprehensive step-by-step tutorials are available in the tutorials repository:

Dashboard

The Colonies Dashboard provides a web UI for monitoring and managing your compute continuum:

Documentation

Getting Started

Installation Guide - Install and configure Colonies
Getting Started - Your first Colonies application
Configuration - Environment variables and settings
Backend Configuration - HTTP, gRPC, CoAP, LibP2P setup

Guides

Introduction - Core concepts and architecture
Implementing Executors - Create executors in Python, Go, Julia, JavaScript
Fibonacci Tutorial (Go) - Complete example application
Workflow DAGs - Create complex computational pipelines
Generators - Batch processing and dynamic workflows
Cron Jobs - Schedule recurring tasks
CLI Usage - Command-line interface reference
Logging - Process logging and monitoring

Architecture & Design

Overall Design - System architecture and design patterns
RPC Protocol - HTTP RPC protocol specification
Security Design - Zero-trust security model

Deployment

Container Building - Build Docker containers for single and multi-platform
High-Availability Deployment - Production cluster setup
Monitoring - Grafana and Prometheus integration
Kubernetes Helm Charts - Deploy on Kubernetes

SDKs & Tools

Go SDK - Official Go client library
Python SDK - Python client library
Rust SDK - Rust client library
Julia SDK - Julia client library
JavaScript SDK - JavaScript/Node.js library
Haskell SDK - Haskell client library
Executors - Pre-built executor implementations

Development

Building

make build              # Build the main colonies binary
make container          # Build Docker container for local architecture
make container-multiplatform  # Build for amd64 and arm64
make install            # Install to /usr/local/bin

For detailed instructions on building containers including multi-platform builds, see the Container Building Guide.

Testing

make test              # Run all tests
make github_test       # Run tests for CI (no color output)

# Test specific backends
COLONIES_BACKEND_TYPE=gin make test
COLONIES_BACKEND_TYPE=grpc make test
COLONIES_BACKEND_TYPE=libp2p make test

Code Coverage

make coverage         # Generate coverage reports

Production Usage

ColonyOS is currently used in production by:

RockSigma AB - Automatic seismic processing engine for underground mines, orchestrating workloads across cloud and edge infrastructure

Contributing

Contributions are welcome! Please see our contributing guidelines and code of conduct.

Community

Website: colonyos.io
GitHub: github.com/colonyos
Tutorials: github.com/colonyos/tutorials

License

See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 725 Commits
.github/workflows		.github/workflows
buildtools		buildtools
cert		cert
cmd		cmd
deployment		deployment
docs		docs
examples		examples
internal		internal
pkg		pkg
tests		tests
vendor		vendor
.gitignore		.gitignore
.goreleaser.yml		.goreleaser.yml
.travis.yml		.travis.yml
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.env		docker-compose.env
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
windowsenv.bat		windowsenv.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ColonyOS - Distributed Meta-Orchestrator

Why ColonyOS?

The Core Idea

Key Advantages

Key Features

Architecture

Core Concepts

How It Works

Zero-Trust Security Model

Multi-Backend Support

Tutorials

Dashboard

Documentation

Getting Started

Guides

Architecture & Design

Deployment

SDKs & Tools

Development

Building

Testing

Code Coverage

Production Usage

Contributing

Community

License

About

Uh oh!

Releases 66

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

colonyos/colonies

Folders and files

Latest commit

History

Repository files navigation

ColonyOS - Distributed Meta-Orchestrator

Why ColonyOS?

The Core Idea

Key Advantages

Key Features

Architecture

Core Concepts

How It Works

Zero-Trust Security Model

Multi-Backend Support

Tutorials

Dashboard

Documentation

Getting Started

Guides

Architecture & Design

Deployment

SDKs & Tools

Development

Building

Testing

Code Coverage

Production Usage

Contributing

Community

License

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 66

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages