Skip to content

cslab-ntua/Pi_Cluster_HPC

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HPC_Cluster-ECE_NTUA_2024-2025

License Raspberry Pi Cluster Nodes OS Automation Monitoring Scheduler MPI Architecture Institution


Table of Contents

  1. Overview
  2. System Architecture
  3. Network Infrastructure
  4. Cluster Components
  5. Performance Evaluation
  6. Troubleshooting
  7. Contributors
  8. Acknowledgements
  9. References

Overview

Pi_Cluster-ECE_NTUA_2024-2025 documents the full setup, configuration, and automation of a High Performance Computing (HPC) cluster built using Raspberry Pi boards by students of the School of Electrical and Computer Engineering, NTUA (2024–2025).

The goal is to create a scalable, educational, and fully functional HPC environment using low-cost hardware.
This environment enables research and learning in parallel computing, distributed systems, and cluster automation.

Key Capabilities

  • Centralized user and resource management (NIS + NFS)
  • Automated provisioning via Ansible
  • Stateless PXE network boot for worker nodes
  • Job scheduling with SLURM and MPI integration
  • Real-time performance monitoring with Prometheus & Grafana

System Architecture

The cluster consists of 18 Raspberry Pi 4 units:

  • 1 Login Node – User entry point, OS image management
  • 1 Master Node – Orchestration, NFS, NIS, SLURM, Monitoring
  • 16 Worker Nodes – Compute units divided into two groups:
    • red1–red8
    • blue1–blue8

Operating System: Ubuntu 24.04 Networking: Static IP configuration over Gigabit Ethernet
Boot Method: PXE (diskless network boot)
Storage: NFS shared filesystem
User Management: Centralized with NIS
Job Scheduling: SLURM + Munge + MariaDB + MPI
Monitoring: Prometheus + Grafana + Node Exporter

Diagram

HPC Architecture


Network Infrastructure

The nodes are interconnected through a Ubiquiti Managed Layer 2 Switch, forming the high-speed backbone of the cluster.

Specification Details
Layer L2 Managed
Switching Speed 88 Gbps
Ports 48 × 1 Gbps Ethernet + 2 × SFP
PoE Support 802.3af / 802.3at (PoE+)
Rack Mountable Yes
Purpose Provides Gigabit connectivity and PoE power for all nodes

This switch ensures stable communication, traffic prioritization, and remote management across the entire HPC network.


Cluster Components

Each subsystem is modular and documented in a dedicated folder.

Subsystem Description Documentation
PXE (Netboot) Enables diskless network boot for all worker nodes PXE Setup
NFS Provides shared filesystem for home directories and datasets NFS Setup
NIS Centralized user authentication and identity service NIS Setup
SLURM Resource and job scheduler integrating MPI and Munge SLURM Setup
Monitoring Prometheus + Grafana + Node Exporter metrics stack Monitoring Setup

Performance Evaluation

To assess performance and scalability, the cluster runs the NAS Parallel Benchmarks (NPB) suite from NASA.

These benchmarks measure:

  • Computation throughput — overall processing speed across CPU cores.
  • Inter-node communication performance — network latency and bandwidth effects on distributed workloads (MPI).
  • Scalability and parallel efficiency — how execution time and speedup change as we increase nodes and cores.
  • Resource and thermal behavior — effects such as throttling or memory limits impacting sustained performance.

Detailed results, plots, and interpretation are available in
Benchmark Results Documentation


Automation

Most of the cluster setup and maintenance workflow is orchestrated through Ansible, ensuring consistent configuration and minimal manual intervention. Using a centralized inventory, the master node can simultaneously deploy updates, install packages, and modify configurations across all 16 worker nodes.

Automation covers almost every major subsystem of the cluster, including:

  • NFS & NIS Configuration: Seamless setup of shared storage and centralized authentication.
  • SLURM Deployment: Automatic installation of SLURM controller and compute daemons with Munge authentication.
  • Monitoring Stack: Installation and configuration of Prometheus, Node Exporters, and Grafana dashboards.

All playbooks, inventory files, and configuration templates are documented in the corresponding folders.


Troubleshooting

All known issues, diagnostic steps, and recovery procedures are centralized in a dedicated guide covering PXE, NFS, NIS, SLURM, and monitoring components.

Read the Troubleshooting Guide


Contributors


Acknowledgements

This project was developed by students of ECE, NTUA (2024–2025) within the framework of Parallel Processing Systems and HPC research.

Special thanks to:

  • CSLab, NTUA, for technical guidance
  • Raspberry Pi Foundation for the open hardware platform
  • The open-source communities of SLURM, Ansible, and Prometheus

References

You may find these links useful
Git Repository Similar Project
Article for Slurm
Another Article for Slurm
PDF of HPC Cluster Documentation
PDF Slides (Probably not Useful)
Article for NFS
Article for the Cluster Setup
Another Article for Cluster
Slurm Documentation
Useful YouTube Video for Slurm
Slurm Installation Repository Tutorial
Slurm Configuration Generator Tool

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 42.2%
  • Python 31.5%
  • Jinja 26.3%