Performance regression in prompt processing

I pulled master today and noticed my pp times have gotten worse.

I bisected the breaking commit to 85a7d8677bf2200981e52f744a21d5267964ffcf (#16812 )

About my system:

96GB DDR5-6000
pcie5 x16
rtx 5090
24/36 layers offloaded to the CPU

The commit before the breaking commit:

  ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       | 999 |    4096 |     4096 |  1 |    0 |          pp4096 |      4095.21 ± 14.55 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       | 999 |    4096 |     4096 |  1 |    0 |  pp4096 @ d4096 |      3858.68 ± 45.20 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       | 999 |    4096 |     4096 |  1 |    0 |  pp4096 @ d8192 |      3682.09 ± 11.12 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       | 999 |    4096 |     4096 |  1 |    0 | pp4096 @ d16384 |      3338.96 ± 19.24 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       | 999 |    4096 |     4096 |  1 |    0 | pp4096 @ d20000 |      3219.27 ± 13.05 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       | 999 |    4096 |     4096 |  1 |    0 | pp4096 @ d32768 |      2818.29 ± 24.60 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       | 999 |    4096 |     4096 |  1 |    0 | pp4096 @ d65536 |      2125.18 ± 18.36 |

build: a8ca18b4b (6867)



The breaking commit:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       | 999 |    4096 |     4096 |  1 |    0 |          pp4096 |      4078.31 ± 17.44 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       | 999 |    4096 |     4096 |  1 |    0 |  pp4096 @ d4096 |      3493.92 ± 17.48 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       | 999 |    4096 |     4096 |  1 |    0 |  pp4096 @ d8192 |      3299.58 ± 71.51 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       | 999 |    4096 |     4096 |  1 |    0 | pp4096 @ d16384 |     2981.73 ± 121.58 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       | 999 |    4096 |     4096 |  1 |    0 | pp4096 @ d20000 |      1950.21 ± 14.90 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       | 999 |    4096 |     4096 |  1 |    0 | pp4096 @ d32768 |      2594.72 ± 41.55 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       | 999 |    4096 |     4096 |  1 |    0 | pp4096 @ d65536 |       2004.36 ± 9.00 |

build: 85a7d8677 (6868)


Notice that performance is slightly worse for all depths, except for depth 20000, which sees a massive 40% drop.

Please let me know if you need any more information. Thanks!




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance regression in prompt processing #17037

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	ngl	n_batch	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	4096	4096	1	pp4096	4095.21 ± 14.55
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	4096	4096	1	pp4096 @ d4096	3858.68 ± 45.20
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	4096	4096	1	pp4096 @ d8192	3682.09 ± 11.12
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	4096	4096	1	pp4096 @ d16384	3338.96 ± 19.24
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	4096	4096	1	pp4096 @ d20000	3219.27 ± 13.05
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	4096	4096	1	pp4096 @ d32768	2818.29 ± 24.60
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	4096	4096	1	pp4096 @ d65536	2125.18 ± 18.36

Performance regression in prompt processing #17037

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions