Skip to content

Conversation

@chaxu01
Copy link
Collaborator

@chaxu01 chaxu01 commented Nov 4, 2025

Benchmarks from MacBook M4:

W/ KleidiAI

GGML_KLEIDIAI_SME=1 ./bin/llama-bench -m ./Llama-3.2-1B-Instruct-Q8_0.gguf -ngl 0 -t 1
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           pp512 |        504.01 ± 2.70 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           tg128 |         93.68 ± 0.16 |

GGML_KLEIDIAI_SME=0 ./bin/llama-bench -m ./Llama-3.2-1B-Instruct-Q8_0.gguf -ngl 0 -t 1,4
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           pp512 |        193.94 ± 1.22 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           tg128 |         43.45 ± 0.34 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           pp512 |        692.11 ± 0.71 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           tg128 |       132.24 ± 16.44 |

W/O KleidiAI

./bin/llama-bench -m ./Llama-3.2-1B-Instruct-Q8_0.gguf -ngl 0 -t 1,4
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           pp512 |         44.39 ± 0.52 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           tg128 |         41.61 ± 0.25 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           pp512 |        156.83 ± 0.62 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           tg128 |        115.41 ± 1.82 |

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 4, 2025
@chaxu01
Copy link
Collaborator Author

chaxu01 commented Nov 6, 2025

Hi @ggerganov, this PR adds Q8_0 optimization kernels for the KleidiAI backend.
The CI shows three failed cases, but they appear to be unrelated (KleidiAI isn’t enabled in those jobs).
Please take a look when you have a moment, thanks!

@ggerganov
Copy link
Member

@chaxu01 Shall we first merge the CI runner (#17021) and then this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants