Skip to content

colonyos/colonies

Repository files navigation

codecov Go

ColonyOSLogo

ColonyOS - Distributed Meta-Orchestrator

ColonyOS is an open-source framework for seamless execution of computational workloads across heterogeneous platforms - cloud, edge, HPC, IoT devices, and beyond. It creates Compute Continuums by providing a unified orchestration layer that operates as a meta-orchestrator on top of existing infrastructure.

Why ColonyOS?

Traditional orchestration systems are tied to specific platforms (Kubernetes for cloud, Slurm for HPC, etc.). ColonyOS breaks these silos through meta-process management - a broker-based architecture that separates computational intent from execution.

Example of use cases:

  • Scientific Computing: Process satellite imagery, analyze sensor data, run simulations across HPC clusters
  • AI/ML Pipelines: Distribute training jobs, run inference on edge devices, orchestrate multi-agent LLM systems
  • Serverless at Scale: Build FaaS platforms that span cloud, edge, and on-premise infrastructure
  • Data Processing: ETL pipelines, batch processing, real-time stream processing with ColonyFS integration
  • Industrial IoT: Coordinate computations across factory floor devices, edge gateways, and cloud
  • Earth Observation: Automated satellite image processing and analysis workflows
  • Infrastructure as Code: Declaratively manage infrastructure across computing continuums - define resources spanning cloud, edge, HPC, and IoT with GitOps workflows, automatic drift detection, and self-healing reconciliation

The Core Idea

Declarative Intent + Broker + Distributed Execution = Computing Continuums

Instead of writing platform-specific code, you declare WHAT you want to compute using Function Specifications. The Colonies Server acts as a broker that matches your intent with available Executors (distributed workers) that know HOW to execute on their specific platforms. This separation creates seamless Computing Continuums across heterogeneous infrastructure.

Key Advantages

  • Platform Agnostic: Same function specification runs on Kubernetes, HPC, edge devices, IoT - executors translate to platform-specific execution
  • Decoupled Architecture: Submit work anytime, execute asynchronously - temporal and spatial decoupling via broker
  • Zero-Trust by Design: No session tokens, no passwords - every request cryptographically signed with Ed25519
  • Protocol Flexibility: Choose HTTP/REST, gRPC, CoAP (IoT), or LibP2P (P2P) - or run them all simultaneously
  • Pull-Based Execution: Executors connect from anywhere (even behind NAT/firewalls) and pull work - no need for inbound access
  • Built-in Audit Trail: Every execution recorded as an immutable ledger for compliance and debugging
  • Real-Time Reactive: WebSocket subscriptions for instant notifications on workflow state changes

Key Features

  • Multi-Protocol Architecture: Native support for HTTP/REST, gRPC, CoAP (IoT), and LibP2P (peer-to-peer)
  • Distributed Execution: Executors run anywhere on the Internet - supercomputers, edge devices, browsers, embedded systems
  • Zero-Trust Security: All communication cryptographically signed with Ed25519
  • Workflow DAGs: Complex computational pipelines with parent-child dependencies
  • Event-Driven: Real-time WebSocket subscriptions for process state changes
  • Scheduled Execution: Cron-based and interval-based job scheduling
  • Dynamic Batching: Generators that pack arguments and trigger workflows based on counter or timeout conditions
  • Resource Reconciliation: Kubernetes-style declarative resource management with automatic drift detection and correction
  • Full Audit Trail: Complete execution history stored as an immutable ledger
  • High Availability: Etcd-based clustering with automatic failover
  • Multi-Language SDKs: Go, Rust, Python, Julia, JavaScript, Haskell

Architecture

Core Concepts

  • Colony: A distributed runtime environment - a network of loosely connected Executors
  • Executor: Distributed worker that pulls and executes workloads (can be implemented in any language, runs anywhere)
  • Process: Computational workload with states: WAITING → RUNNING → SUCCESS/FAILED
  • FunctionSpec: Specification defining what computation to run and execution conditions
  • ProcessGraph: Workflow represented as a Directed Acyclic Graph (DAG)
  • Resource: Declarative infrastructure specification with desired state management
  • Reconciliation: Automatic drift detection and correction that maintains resources in their desired state

How It Works

  1. Submit: Users submit function specifications to the Colonies server
  2. Schedule: The scheduler assigns processes to available Executors based on conditions
  3. Execute: Executors pull assigned processes, execute them, and report results
  4. Chain: Complex workflows span multiple platforms by chaining processes together
  5. Monitor: Real-time subscriptions and full execution history enable observability

MetaOS

Zero-Trust Security Model

Colonies implements a zero-trust architecture where all communication is cryptographically signed:

  • No traditional authentication tokens or session management
  • Each request signed with Ed25519 private keys
  • Server validates signatures and enforces role-based access control
  • Executors can operate on untrusted infrastructure while maintaining security

Multi-Backend Support

Run Colonies server with any combination of protocols:

Backend Use Case Port
HTTP/REST Web APIs, dashboards, traditional clients 8080
gRPC High-performance, low-latency communication 50051
CoAP IoT devices, constrained environments 5683
LibP2P Peer-to-peer, decentralized, NAT traversal 4001

Configure via environment variable:

export COLONIES_SERVER_BACKENDS="http,grpc,libp2p"  # Run multiple protocols simultaneously

Tutorials

Comprehensive step-by-step tutorials are available in the tutorials repository:

Dashboard

The Colonies Dashboard provides a web UI for monitoring and managing your compute continuum:

Dashboard Dashboard Dashboard

Documentation

Getting Started

Guides

Architecture & Design

Deployment

SDKs & Tools

Development

Building

make build              # Build the main colonies binary
make container          # Build Docker container for local architecture
make container-multiplatform  # Build for amd64 and arm64
make install            # Install to /usr/local/bin

For detailed instructions on building containers including multi-platform builds, see the Container Building Guide.

Testing

make test              # Run all tests
make github_test       # Run tests for CI (no color output)

# Test specific backends
COLONIES_BACKEND_TYPE=gin make test
COLONIES_BACKEND_TYPE=grpc make test
COLONIES_BACKEND_TYPE=libp2p make test

Code Coverage

make coverage         # Generate coverage reports

Production Usage

ColonyOS is currently used in production by:

  • RockSigma AB - Automatic seismic processing engine for underground mines, orchestrating workloads across cloud and edge infrastructure

Contributing

Contributions are welcome! Please see our contributing guidelines and code of conduct.

Community

License

See LICENSE file for details.

About

Colonies is a distributed framework to implement a meta-operating system.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages