Skip to content

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.

License

Notifications You must be signed in to change notification settings

lablup/backend.ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Backend.AI

PyPI release version Supported Python versions Wheels Gitter

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, Rebellions, FuriosaAI, HyperAccel, Google TPU, Graphcore IPU and other NPUs.

It allocates and isolates the underlying computing resources for multi-tenant computation sessions on-demand or in batches with customizable job schedulers with its own orchestrator named "Sokovan".

All its functions are exposed as REST and GraphQL APIs.

Requirements

Python & Build Tools

Infrastructure

Required:

  • Docker 20.10+ (with Compose v2)
  • PostgreSQL 16+ (tested with 16.3)
  • Redis 7.2+ (tested with 7.2.11)
  • etcd 3.5+ (tested with 3.5.14)
  • Prometheus 3.x (tested with 3.1.0)

Recommended (for observability):

  • Grafana 11.x (tested with 11.4.0)
  • Loki 3.x (tested with 3.5.0)
  • Tempo 2.x (tested with 2.7.2)
  • OpenTelemetry Collector

→ Detailed infrastructure setup: Infrastructure Documentation

System

  • OS: Linux (Debian/RHEL-based) or macOS
  • Permissions: sudo access for installation
  • Resources: 4+ CPU cores, 8GB+ RAM recommended for development

Getting Started

Quick Start (Development)

1. Clone and Install

git clone https://github.com/lablup/backend.ai.git
cd backend.ai
./scripts/install-dev.sh

This script will:

  • Check required dependencies (Docker, Python, etc.)
  • Set up Python virtual environment with Pantsbuild
  • Start halfstack infrastructure (PostgreSQL, Redis, etcd, Grafana, etc.)
  • Initialize database schemas
  • Create default API keypairs and user accounts

2. Start Backend.AI Services

Start each component in separate terminals:

Manager (Terminal 1):

./backend.ai mgr start-server --debug

Agent (Terminal 2):

./backend.ai ag start-server --debug

Storage Proxy (Terminal 3):

./py -m ai.backend.storage.server

Web Server (Terminal 4):

./py -m ai.backend.web.server

App Proxy (Terminal 5-6, optional for in-container service access):

./backend.ai app-proxy-coordinator start-server --debug
./backend.ai app-proxy-worker start-server --debug

3. Run Your First Session

Set up client environment:

source env-local-user-session.sh

Run a simple Python session:

./backend.ai run python -c "print('Hello Backend.AI!')"

Or access Web UI at http://localhost:8090 with credentials from env-local-*.sh files.

Accessing Compute Sessions (aka Kernels)

Backend.AI provides websocket tunneling into individual computation sessions (containers), so that users can use their browsers and client CLI to access in-container applications directly in a secure way.

  • Jupyter: data scientists' favorite tool
    • Most container images have intrinsic Jupyter and JupyterLab support.
  • Web-based terminal
    • All container sessions have intrinsic ttyd support.
  • SSH
    • All container sessions have intrinsic SSH/SFTP/SCP support with auto-generated per-user SSH keypair. PyCharm and other IDEs can use on-demand sessions using SSH remote interpreters.
  • VSCode
    • Most container sessions have intrinsic web-based VSCode support.

Working with Storage

Backend.AI provides an abstraction layer on top of existing network-based storages (e.g., NFS/SMB), called vfolders (virtual folders). Each vfolder works like a cloud storage that can be mounted into any computation sessions and shared between users and user groups with differentiated privileges.

Installation for Multi-node Tests & Production

Please consult our documentation for community-supported materials. Contact the sales team (contact@lablup.com) for professional paid support and deployment options.

Architecture

For comprehensive system architecture, component interactions, and infrastructure details, see:

Component Architecture Documentation

This document covers:

  • System architecture diagrams and component flow
  • Port numbers and infrastructure setup
  • Component dependencies and communication protocols
  • Development and production environment configuration

Contents in This Repository

This repository contains all open-source server-side components and the client SDK for Python as a reference implementation of API clients.

Directory Structure

  • src/ai/backend/: Source codes
    • manager/: Manager as the cluster control-plane
    • manager/api: Manager API handlers
    • account_manager/: Unified user profile and SSO management
    • agent/: Agent as per-node controller
    • agent/docker/: Agent's Docker backend
    • agent/k8s/: Agent's Kubernetes backend
    • agent/dummy/: Agent's dummy backend
    • kernel/: Agent's kernel runner counterpart
    • runner/: Agent's in-kernel prebuilt binaries
    • helpers/: Agent's in-kernel helper package
    • common/: Shared utilities
    • client/: Client SDK
    • cli/: Unified CLI for all components
    • install/: SCIE-based TUI installer
    • storage/: Storage proxy for offloading storage operations
    • storage/api: Storage proxy's manager-facing and client-facing APIs
    • appproxy/: App proxy for accessing container apps from outside
    • appproxy/coordinator: App proxy coordinator who provisions routing circuits
    • appproxy/worker: App proxy worker who forwards the traffic
    • web/: Web UI server
      • static/: Backend.AI WebUI release artifacts
    • logging/: Logging subsystem
    • plugin/: Plugin subsystem
    • test/: Integration test suite
    • testutils/: Shared utilities used by unit tests
    • meta/: Legacy meta package
    • accelerator/: Intrinsic accelerator plugins
  • docs/: Unified documentation
  • tests/
    • manager/, agent/, ...: Per-component unit tests
  • configs/
    • manager/, agent/, ...: Per-component sample configurations
  • docker/: Dockerfiles for auxiliary containers
  • fixtures/
    • manager/, ...: Per-component fixtures for development setup and tests
  • plugins/: A directory to place plugins such as accelerators, monitors, etc.
  • scripts/: Scripts to assist development workflows
    • install-dev.sh: The single-node development setup script from the working copy
  • stubs/: Type annotation stub packages written by us
  • tools/: A directory to host Pants-related tooling
  • dist/: A directory to put build artifacts (.whl files) and Pants-exported virtualenvs
  • changes/: News fragments for towncrier
  • pants.toml: The Pants configuration
  • pyproject.toml: Tooling configuration (towncrier, pytest, mypy)
  • BUILD: The root build config file
  • **/BUILD: Per-directory build config files
  • BUILD_ROOT: An indicator to mark the build root directory for Pants
  • CLAUDE.md: The steering guide for agent-assisted development
  • requirements.txt: The unified requirements file
  • *.lock, tools/*.lock: The dependency lock files
  • docker-compose.*.yml: Per-version recommended halfstack container configs
  • README.md: This file
  • MIGRATION.md: The migration guide for updating between major releases
  • VERSION: The unified version declaration

Server-side components are licensed under LGPLv3 to promote non-proprietary open innovation in the open-source community while other shared libraries and client SDKs are distributed under the MIT license.

There is no obligation to open your service/system codes if you just run the server-side components as-is (e.g., just run as daemons or import the components without modification in your codes). Please contact us (contact-at-lablup-com) for commercial consulting and more licensing details/options about individual use-cases.

Major Components

Backend.AI consists of the following core components:

Server-Side Components

Manager - Central API gateway and orchestrator

  • Routes REST/GraphQL requests and orchestrates cluster operations
  • Session scheduling via Sokovan orchestrator
  • User authentication and RBAC authorization
  • Plugin interfaces: backendai_scheduler_v10, backendai_agentselector_v10, backendai_hook_v20, backendai_webapp_v20, backendai_monitor_stats_v10, backendai_monitor_error_v10
  • Legacy repo: https://github.com/lablup/backend.ai-manager

Agent - Kernel lifecycle management on compute nodes

  • Manages Docker containers (kernels) on individual nodes
  • Self-registers to cluster via heartbeats
  • Plugin interfaces: backendai_accelerator_v21, backendai_monitor_stats_v10, backendai_monitor_error_v10
  • Legacy repo: https://github.com/lablup/backend.ai-agent

Storage Proxy - Virtual folder and storage backend abstraction

Webserver - Web UI hosting and session management

Synchronizing the static Backend.AI WebUI version:

$ scripts/download-webui-release.sh <target version to download>

App Proxy - Service routing and load balancing

  • Routes traffic to in-container services (Jupyter, VSCode, etc.)
  • Dynamic circuit provisioning and health monitoring

Container Runtime Components

Kernels - Container image recipes

  • Dockerfile-based computing environment recipes
  • Support for popular ML frameworks and programming languages

Jail - Programmable sandbox (Rust)

  • ptrace-based system call filtering
  • Resource control and security enforcement

Hook - In-container runtime library

  • libc overrides for resource control
  • Web-based interactive stdin support

Client SDK Libraries

We offer client SDKs in popular programming languages (MIT License):

  • Python - pip install backend.ai-client | GitHub | Includes CLI
  • Java - Releases
  • Javascript - npm install backend.ai-client | GitHub
  • PHP - (under preparation) composer require lablup/backend.ai-client | GitHub

Plugins

Backend.AI supports plugin-based extensibility via Python package entrypoints:

Accelerator Plugins (backendai_accelerator_v21)

  • CUDA - NVIDIA GPU support
  • CUDA Mock - Development without actual GPUs
  • ROCm - AMD GPU support
  • More available in the enterprise edition

Monitoring Plugins

Legacy Components

Media Library - Multi-media output support (no longer maintained)

IDE Extensions - (Deprecated: Use in-kernel Jupyter Lab, VSCode Server, or SSH instead)

Development

Building Packages

Build Python wheels or SCIE (Self-Contained Installable Executables):

./scripts/build-wheels.sh  # Build .whl packages
./scripts/build-scies.sh   # Build SCIE packages

Packages are placed in dist/ directory.

Code Quality Hooks

Backend.AI uses Git pre-commit hooks to maintain code quality:

# Automatically runs on every commit:
# - Linting (pants lint)
# - Type checking (pants check)

# Bypass hooks if needed (use sparingly)
git commit --no-verify

The pre-commit hook validates:

  • Code style and formatting
  • Type annotations

Tests run in CI for comprehensive coverage.

See CLAUDE.md for detailed hook system documentation.

Development Guide

For detailed development setup, build system usage, and contribution guidelines:

License

Refer to LICENSE file.

About

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published