Skip to content

Conversation

@ggerganov
Copy link
Member

fix #17037
cont #16812

The small SWA caches can be padded to 256 without concerns about memory usage. This is friendly for the CUDA backend since the FA implementation benefits from round sizes of the K/V tensors.

GGML_CUDA=ON CUDA_VISIBLE_DEVICES=0 ./scripts/compare-commits.sh a8ca18b4b 5d884e6a6 llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -m /home/ggerganov/.cache/llama.cpp/ggml-org_Qwen2.5-Coder-3B-Q8_0-GGUF_qwen2.5-coder-3b-q8_0.gguf -ngl 99 -d 4096,8192,16384,32768 -ub 512,4096 -b 4096 -fa 1 -n 0 -mmp 0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

Model Microbatch size Test t/s a8ca18b t/s 5d884e6 Speedup
gpt-oss 20B MXFP4 MoE 512 pp512@d4096 9696.84 9759.83 1.01
gpt-oss 20B MXFP4 MoE 512 pp512@d8192 8654.61 8583.57 0.99
gpt-oss 20B MXFP4 MoE 512 pp512@d16384 7193.83 7056.71 0.98
gpt-oss 20B MXFP4 MoE 512 pp512@d32768 5358.46 5258.01 0.98
gpt-oss 20B MXFP4 MoE 4096 pp512@d4096 8676.29 8684.45 1.00
gpt-oss 20B MXFP4 MoE 4096 pp512@d8192 7926.21 7804.37 0.98
gpt-oss 20B MXFP4 MoE 4096 pp512@d16384 6665.34 6600.58 0.99
gpt-oss 20B MXFP4 MoE 4096 pp512@d32768 5054.64 5032.95 1.00
qwen2 3B Q8_0 512 pp512@d4096 17220.95 17256.36 1.00
qwen2 3B Q8_0 512 pp512@d8192 14029.22 14099.89 1.01
qwen2 3B Q8_0 512 pp512@d16384 10104.63 10310.87 1.02
qwen2 3B Q8_0 512 pp512@d32768 6621.30 6614.64 1.00
qwen2 3B Q8_0 4096 pp512@d4096 16843.04 16740.16 0.99
qwen2 3B Q8_0 4096 pp512@d8192 13618.84 13626.10 1.00
qwen2 3B Q8_0 4096 pp512@d16384 10086.70 10084.63 1.00
qwen2 3B Q8_0 4096 pp512@d32768 6591.20 6638.43 1.01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance regression in prompt processing

2 participants