Skip to content

Performance regression in prompt processing #17037

@justmyselfyouknow

Description

@justmyselfyouknow

I pulled master today and noticed my pp times have gotten worse.

I bisected the breaking commit to 85a7d86 (#16812 )

About my system:

96GB DDR5-6000
pcie5 x16
rtx 5090
24/36 layers offloaded to the CPU

The commit before the breaking commit:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

model size params backend ngl n_batch n_ubatch fa mmap test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 4096 4096 1 0 pp4096 4095.21 ± 14.55
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 4096 4096 1 0 pp4096 @ d4096 3858.68 ± 45.20
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 4096 4096 1 0 pp4096 @ d8192 3682.09 ± 11.12
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 4096 4096 1 0 pp4096 @ d16384 3338.96 ± 19.24
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 4096 4096 1 0 pp4096 @ d20000 3219.27 ± 13.05
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 4096 4096 1 0 pp4096 @ d32768 2818.29 ± 24.60
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 4096 4096 1 0 pp4096 @ d65536 2125.18 ± 18.36

build: a8ca18b (6867)

The breaking commit:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

model size params backend ngl n_batch n_ubatch fa mmap test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 4096 4096 1 0 pp4096 4078.31 ± 17.44
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 4096 4096 1 0 pp4096 @ d4096 3493.92 ± 17.48
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 4096 4096 1 0 pp4096 @ d8192 3299.58 ± 71.51
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 4096 4096 1 0 pp4096 @ d16384 2981.73 ± 121.58
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 4096 4096 1 0 pp4096 @ d20000 1950.21 ± 14.90
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 4096 4096 1 0 pp4096 @ d32768 2594.72 ± 41.55
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 4096 4096 1 0 pp4096 @ d65536 2004.36 ± 9.00

build: 85a7d86 (6868)

Notice that performance is slightly worse for all depths, except for depth 20000, which sees a massive 40% drop.

Please let me know if you need any more information. Thanks!

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions