HPC_Cluster-ECE_NTUA_2024-2025

Overview

Pi_Cluster-ECE_NTUA_2024-2025 documents the full setup, configuration, and automation of a High Performance Computing (HPC) cluster built using Raspberry Pi boards by students of the School of Electrical and Computer Engineering, NTUA (2024–2025).

The goal is to create a scalable, educational, and fully functional HPC environment using low-cost hardware.
This environment enables research and learning in parallel computing, distributed systems, and cluster automation.

Key Capabilities

Centralized user and resource management (NIS + NFS)
Automated provisioning via Ansible
Stateless PXE network boot for worker nodes
Job scheduling with SLURM and MPI integration
Real-time performance monitoring with Prometheus & Grafana

System Architecture

The cluster consists of 18 Raspberry Pi 4 units:

1 Login Node – User entry point, OS image management
1 Master Node – Orchestration, NFS, NIS, SLURM, Monitoring
16 Worker Nodes – Compute units divided into two groups:
- red1–red8
- blue1–blue8

Operating System: Ubuntu 24.04 Networking: Static IP configuration over Gigabit Ethernet
Boot Method: PXE (diskless network boot)
Storage: NFS shared filesystem
User Management: Centralized with NIS
Job Scheduling: SLURM + Munge + MariaDB + MPI
Monitoring: Prometheus + Grafana + Node Exporter

Diagram

Network Infrastructure

The nodes are interconnected through a Ubiquiti Managed Layer 2 Switch, forming the high-speed backbone of the cluster.

Specification	Details
Layer	L2 Managed
Switching Speed	88 Gbps
Ports	48 × 1 Gbps Ethernet + 2 × SFP
PoE Support	802.3af / 802.3at (PoE+)
Rack Mountable	Yes
Purpose	Provides Gigabit connectivity and PoE power for all nodes

This switch ensures stable communication, traffic prioritization, and remote management across the entire HPC network.

Cluster Components

Each subsystem is modular and documented in a dedicated folder.

Subsystem	Description	Documentation
PXE (Netboot)	Enables diskless network boot for all worker nodes	PXE Setup
NFS	Provides shared filesystem for home directories and datasets	NFS Setup
NIS	Centralized user authentication and identity service	NIS Setup
SLURM	Resource and job scheduler integrating MPI and Munge	SLURM Setup
Monitoring	Prometheus + Grafana + Node Exporter metrics stack	Monitoring Setup

Performance Evaluation

To assess performance and scalability, the cluster runs the NAS Parallel Benchmarks (NPB) suite from NASA.

These benchmarks measure:

Computation throughput — overall processing speed across CPU cores.
Inter-node communication performance — network latency and bandwidth effects on distributed workloads (MPI).
Scalability and parallel efficiency — how execution time and speedup change as we increase nodes and cores.
Resource and thermal behavior — effects such as throttling or memory limits impacting sustained performance.

Detailed results, plots, and interpretation are available in
Benchmark Results Documentation

Automation

Most of the cluster setup and maintenance workflow is orchestrated through Ansible, ensuring consistent configuration and minimal manual intervention. Using a centralized inventory, the master node can simultaneously deploy updates, install packages, and modify configurations across all 16 worker nodes.

Automation covers almost every major subsystem of the cluster, including:

NFS & NIS Configuration: Seamless setup of shared storage and centralized authentication.
SLURM Deployment: Automatic installation of SLURM controller and compute daemons with Munge authentication.
Monitoring Stack: Installation and configuration of Prometheus, Node Exporters, and Grafana dashboards.

All playbooks, inventory files, and configuration templates are documented in the corresponding folders.

Troubleshooting

All known issues, diagnostic steps, and recovery procedures are centralized in a dedicated guide covering PXE, NFS, NIS, SLURM, and monitoring components.

Read the Troubleshooting Guide

Contributors

Acknowledgements

This project was developed by students of ECE, NTUA (2024–2025) within the framework of Parallel Processing Systems and HPC research.

Special thanks to:

CSLab, NTUA, for technical guidance
Raspberry Pi Foundation for the open hardware platform
The open-source communities of SLURM, Ansible, and Prometheus

References

You may find these links useful
Git Repository Similar Project
Article for Slurm
Another Article for Slurm
PDF of HPC Cluster Documentation
PDF Slides (Probably not Useful)
Article for NFS
Article for the Cluster Setup
Another Article for Cluster
Slurm Documentation
Useful YouTube Video for Slurm
Slurm Installation Repository Tutorial
Slurm Configuration Generator Tool

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HPC_Cluster-ECE_NTUA_2024-2025

Table of Contents

Overview

Key Capabilities

System Architecture

Diagram

Network Infrastructure

Cluster Components

Performance Evaluation

Automation

Troubleshooting

Contributors

Acknowledgements

References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
Monitoring		Monitoring
NFS		NFS
NIS		NIS
PXE		PXE
SLURM		SLURM
Troubleshooting		Troubleshooting
benchmarks		benchmarks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

cslab-ntua/Pi_Cluster_HPC

Folders and files

Latest commit

History

Repository files navigation

HPC_Cluster-ECE_NTUA_2024-2025

Table of Contents

Overview

Key Capabilities

System Architecture

Diagram

Network Infrastructure

Cluster Components

Performance Evaluation

Automation

Troubleshooting

Contributors

Acknowledgements

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages