This repository contains scripts and source code to build and run vLLM inside an Apptainer container on an HPC system with SLURM.
- jobs/: SLURM job scripts to build the container and run inference.
- src/: Python source code for running vLLM inference.
This repository is the codebase for the tutorial here: https://servicedesk.surf.nl/wiki/spaces/WIKI/pages/232851290/LLM+inference+on+Snellius+with+vLLM
vllm-inference-slurm/
├── README.md
├── .gitignore
├── jobs/
│ ├── build_vllm.job
│ └── run_vllm_serve.job
├── src/
│ └── vllm_serve.py
Skip this step if you the prebuilt container on Snellius is sufficient. Refer to here
Build an NGC vLLM container from here. Note: these containers do always run smoothly on every model from HuggingFace. Consider using the CUDA container below.
sbatch jobs/build_vllm.jobAlternatively, use a CUDA container and install torch, vLLM, etc from here. While this container generally works, it might not be optimized for hardware as the container above.
sbatch jobs/build_cuda_vllm_apptainer.jobsbatch jobs/run_vllm_serve.job# salloc a GPU...
chmod +x jobs/run_vllm_serve.job
./jobs/run_vllm_serve.jobPlease specify in jobs/run_vllm_serve.job the environment variables corresponding to the vLLM task. Below is a machine translation task with the Tower-Plus architecture and the template provided in src/vllm_serve.py
MODEL_CHECKPOINT=Unbabel/Tower-Plus-2B
DATASET=openai/gsm8k
TEMPLATE_PRESET=gsm8k
DATA_SPLIT=test[:100]
VLLM_BASE_URL=http://localhost:$PORT/v1
TEMPERATURE=0.7
MAX_TOKENS=256
MAX_CONCURRENT=64
OUTPUT_JSON=predictions.json
This corresponds to src/vllm_serve.py
options:
-h, --help show this help message and exit
--model MODEL Model name
--dataset DATASET HuggingFace dataset name
--split SPLIT Dataset split
--subset SUBSET Dataset subset
--base_url BASE_URL vLLM server URL
--temperature TEMPERATURE
Sampling temperature
--max_tokens MAX_TOKENS
Max tokens to generate
--max_concurrent MAX_CONCURRENT
Max concurrent requests
--instruction_template INSTRUCTION_TEMPLATE
Instruction template with {field_name} placeholders (e.g., 'Solve: {question}')
--template_preset {gsm8k,alpaca,squad,mmlu,tower,default}
Use a preset template
--output OUTPUT Output file