kleidiai: add optimized per-channel kernels for Q8_0 #16993

chaxu01 · 2025-11-04T10:07:51Z

Benchmarks from MacBook M4:

W/ KleidiAI

GGML_KLEIDIAI_SME=1 ./bin/llama-bench -m ./Llama-3.2-1B-Instruct-Q8_0.gguf -ngl 0 -t 1
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           pp512 |        504.01 ± 2.70 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           tg128 |         93.68 ± 0.16 |

GGML_KLEIDIAI_SME=0 ./bin/llama-bench -m ./Llama-3.2-1B-Instruct-Q8_0.gguf -ngl 0 -t 1,4
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           pp512 |        193.94 ± 1.22 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           tg128 |         43.45 ± 0.34 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           pp512 |        692.11 ± 0.71 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           tg128 |       132.24 ± 16.44 |

W/O KleidiAI

./bin/llama-bench -m ./Llama-3.2-1B-Instruct-Q8_0.gguf -ngl 0 -t 1,4
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           pp512 |         44.39 ± 0.52 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           tg128 |         41.61 ± 0.25 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           pp512 |        156.83 ± 0.62 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           tg128 |        115.41 ± 1.82 |

chaxu01 · 2025-11-06T12:35:46Z

Hi @ggerganov, this PR adds Q8_0 optimization kernels for the KleidiAI backend.
The CI shows three failed cases, but they appear to be unrelated (KleidiAI isn’t enabled in those jobs).
Please take a look when you have a moment, thanks!

ggerganov · 2025-11-06T16:15:32Z

@chaxu01 Shall we first merge the CI runner (#17021) and then this PR?

kleidiai: add optimized per-channel kernels for Q8_0

2e03c52

chaxu01 requested review from ggerganov and slaren as code owners November 4, 2025 10:07

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 4, 2025

DajanaV mentioned this pull request Nov 4, 2025

UPSTREAM PR #16993: kleidiai: add optimized per-channel kernels for Q8_0 auroralabs-loci/llama.cpp#76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kleidiai: add optimized per-channel kernels for Q8_0 #16993

kleidiai: add optimized per-channel kernels for Q8_0 #16993

Uh oh!

chaxu01 commented Nov 4, 2025

Uh oh!

chaxu01 commented Nov 6, 2025

Uh oh!

ggerganov commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kleidiai: add optimized per-channel kernels for Q8_0 #16993

Are you sure you want to change the base?

kleidiai: add optimized per-channel kernels for Q8_0 #16993

Uh oh!

Conversation

chaxu01 commented Nov 4, 2025

Uh oh!

chaxu01 commented Nov 6, 2025

Uh oh!

ggerganov commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants