Docker images are based on Nvidia CUDA images. LLMs are pre-loaded and served via vLLM.
- TENSOR_PARALLEL_SIZE: Number of GPUs to use. Default:- 1.
The OpenAI API is exposed on port 8000.
Note
The VRAM column is the minimum required amount of VRAM used by the model on a single GPU.
| Tag | Model | RunPod | Vast.ai | VRAM | 
|---|---|---|---|---|
| ivangabriele/llm:lmsys__vicuna-13b-v1.5-16k | 26GB | |||
| ivangabriele/llm:open-orca__llongorca-13b-16k | 26GB | 
- Add more popular models.
- Start the server in background to allow for SSH access.