kv-cache : pad the size of the small SWA cache for performance #17046

ggerganov · 2025-11-06T08:17:06Z

The small SWA caches can be padded to 256 without concerns about memory usage. This is friendly for the CUDA backend since the FA implementation benefits from round sizes of the K/V tensors.

GGML_CUDA=ON CUDA_VISIBLE_DEVICES=0 ./scripts/compare-commits.sh a8ca18b4b 5d884e6a6 llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -m /home/ggerganov/.cache/llama.cpp/ggml-org_Qwen2.5-Coder-3B-Q8_0-GGUF_qwen2.5-coder-3b-q8_0.gguf -ngl 99 -d 4096,8192,16384,32768 -ub 512,4096 -b 4096 -fa 1 -n 0 -mmp 0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

Model	Microbatch size	Test	t/s `a8ca18b`	t/s `5d884e6`	Speedup
gpt-oss 20B MXFP4 MoE	512	pp512@d4096	9696.84	9759.83	1.01
gpt-oss 20B MXFP4 MoE	512	pp512@d8192	8654.61	8583.57	0.99
gpt-oss 20B MXFP4 MoE	512	pp512@d16384	7193.83	7056.71	0.98
gpt-oss 20B MXFP4 MoE	512	pp512@d32768	5358.46	5258.01	0.98
gpt-oss 20B MXFP4 MoE	4096	pp512@d4096	8676.29	8684.45	1.00
gpt-oss 20B MXFP4 MoE	4096	pp512@d8192	7926.21	7804.37	0.98
gpt-oss 20B MXFP4 MoE	4096	pp512@d16384	6665.34	6600.58	0.99
gpt-oss 20B MXFP4 MoE	4096	pp512@d32768	5054.64	5032.95	1.00
qwen2 3B Q8_0	512	pp512@d4096	17220.95	17256.36	1.00
qwen2 3B Q8_0	512	pp512@d8192	14029.22	14099.89	1.01
qwen2 3B Q8_0	512	pp512@d16384	10104.63	10310.87	1.02
qwen2 3B Q8_0	512	pp512@d32768	6621.30	6614.64	1.00
qwen2 3B Q8_0	4096	pp512@d4096	16843.04	16740.16	0.99
qwen2 3B Q8_0	4096	pp512@d8192	13618.84	13626.10	1.00
qwen2 3B Q8_0	4096	pp512@d16384	10086.70	10084.63	1.00
qwen2 3B Q8_0	4096	pp512@d32768	6591.20	6638.43	1.01

kv-cache : pad the size of the small SWA cache for performance

5d884e6

ggerganov mentioned this pull request Nov 6, 2025

Performance regression in prompt processing #17037

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kv-cache : pad the size of the small SWA cache for performance #17046

kv-cache : pad the size of the small SWA cache for performance #17046

ggerganov commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kv-cache : pad the size of the small SWA cache for performance #17046

Are you sure you want to change the base?

kv-cache : pad the size of the small SWA cache for performance #17046

Conversation

ggerganov commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants