Deploy a llama.cpp server on fly.io.
Uses the most minimal dependencies possible to create a small image. Downloads model files on initial boot and caches them in a volume for fast subsequent cold starts.
fly launch --no-deploy
fly vol create models -s 10 --vm-gpu-kind a10 --region ord
fly secrets set API_KEY=<your-api-key>
fly deployThe provided Dockerfile is configured to use the a10 GPU kind. To use a different GPU:
- Update the
CUDA_DOCKER_ARCHvariable in the build step to an appropriate value for the desired GPU. A list of arch values can be found here. e.g. putCUDA_DOCKER_ARCH=compute_86for compute capability 8.6. - Update the
--vm-gpu-kindflag in thefly vol createcommand to the desired GPU kind. e.g. put--vm-gpu-kind a100for an A100 GPU. - Update the vm.gpu_kind in the fly.toml file to the desired GPU kind. e.g. put
gpu_kind = "a100"for an A100 GPU.
This example uses the phi-3-mini-4k-instruct model by default. To use a different model:
- update the
MODEL_URLandMODEL_FILEenv variables in the fly.toml file to your desired model. The file will be downloaded as/models/$MODEL_FILEon next deploy. - To delete any existing model files, use
fly ssh consoleto connect to your machine and runrm /models/<model_file>.
This example sets the --api-key flag on the server start command to guard against unauthorized access. To set the API key:
fly secrets set API_KEY=<your-api-key>The app will use the new API key on the next deploy.