- Overview
- System Architecture
- Network Infrastructure
- Cluster Components
- Performance Evaluation
- Troubleshooting
- Contributors
- Acknowledgements
- References
Pi_Cluster-ECE_NTUA_2024-2025 documents the full setup, configuration, and automation of a High Performance Computing (HPC) cluster built using Raspberry Pi boards by students of the School of Electrical and Computer Engineering, NTUA (2024–2025).
The goal is to create a scalable, educational, and fully functional HPC environment using low-cost hardware.
This environment enables research and learning in parallel computing, distributed systems, and cluster automation.
- Centralized user and resource management (NIS + NFS)
- Automated provisioning via Ansible
- Stateless PXE network boot for worker nodes
- Job scheduling with SLURM and MPI integration
- Real-time performance monitoring with Prometheus & Grafana
The cluster consists of 18 Raspberry Pi 4 units:
- 1 Login Node – User entry point, OS image management
- 1 Master Node – Orchestration, NFS, NIS, SLURM, Monitoring
- 16 Worker Nodes – Compute units divided into two groups:
red1–red8blue1–blue8
Operating System: Ubuntu 24.04
Networking: Static IP configuration over Gigabit Ethernet
Boot Method: PXE (diskless network boot)
Storage: NFS shared filesystem
User Management: Centralized with NIS
Job Scheduling: SLURM + Munge + MariaDB + MPI
Monitoring: Prometheus + Grafana + Node Exporter
The nodes are interconnected through a Ubiquiti Managed Layer 2 Switch, forming the high-speed backbone of the cluster.
| Specification | Details |
|---|---|
| Layer | L2 Managed |
| Switching Speed | 88 Gbps |
| Ports | 48 × 1 Gbps Ethernet + 2 × SFP |
| PoE Support | 802.3af / 802.3at (PoE+) |
| Rack Mountable | Yes |
| Purpose | Provides Gigabit connectivity and PoE power for all nodes |
This switch ensures stable communication, traffic prioritization, and remote management across the entire HPC network.
Each subsystem is modular and documented in a dedicated folder.
| Subsystem | Description | Documentation |
|---|---|---|
| PXE (Netboot) | Enables diskless network boot for all worker nodes | PXE Setup |
| NFS | Provides shared filesystem for home directories and datasets | NFS Setup |
| NIS | Centralized user authentication and identity service | NIS Setup |
| SLURM | Resource and job scheduler integrating MPI and Munge | SLURM Setup |
| Monitoring | Prometheus + Grafana + Node Exporter metrics stack | Monitoring Setup |
To assess performance and scalability, the cluster runs the NAS Parallel Benchmarks (NPB) suite from NASA.
These benchmarks measure:
- Computation throughput — overall processing speed across CPU cores.
- Inter-node communication performance — network latency and bandwidth effects on distributed workloads (MPI).
- Scalability and parallel efficiency — how execution time and speedup change as we increase nodes and cores.
- Resource and thermal behavior — effects such as throttling or memory limits impacting sustained performance.
Detailed results, plots, and interpretation are available in
Benchmark Results Documentation
Most of the cluster setup and maintenance workflow is orchestrated through Ansible, ensuring consistent configuration and minimal manual intervention. Using a centralized inventory, the master node can simultaneously deploy updates, install packages, and modify configurations across all 16 worker nodes.
Automation covers almost every major subsystem of the cluster, including:
- NFS & NIS Configuration: Seamless setup of shared storage and centralized authentication.
- SLURM Deployment: Automatic installation of SLURM controller and compute daemons with Munge authentication.
- Monitoring Stack: Installation and configuration of Prometheus, Node Exporters, and Grafana dashboards.
All playbooks, inventory files, and configuration templates are documented in the corresponding folders.
All known issues, diagnostic steps, and recovery procedures are centralized in a dedicated guide covering PXE, NFS, NIS, SLURM, and monitoring components.
Read the Troubleshooting Guide
- Nikolas Spyropoulos
- Giannis Polychronopoulos
- Nikolaos Tsalkitzis
- Nikolaos Mpeligiannis
- George Kouseris
This project was developed by students of ECE, NTUA (2024–2025) within the framework of Parallel Processing Systems and HPC research.
Special thanks to:
- CSLab, NTUA, for technical guidance
- Raspberry Pi Foundation for the open hardware platform
- The open-source communities of SLURM, Ansible, and Prometheus
You may find these links useful
Git Repository Similar Project
Article for Slurm
Another Article for Slurm
PDF of HPC Cluster Documentation
PDF Slides (Probably not Useful)
Article for NFS
Article for the Cluster Setup
Another Article for Cluster
Slurm Documentation
Useful YouTube Video for Slurm
Slurm Installation Repository Tutorial
Slurm Configuration Generator Tool