Skip to content

Conversation

@tarruda
Copy link

@tarruda tarruda commented Nov 6, 2025

This is adapted from the ollama backend, but uses llama.cpp server. Another difference is that it passes/receives raw tokens from llama.cpp.

This is adapted from the ollama backend, but uses llama.cpp server. Another
difference is that it passes/receives raw tokens from llama.cpp.
@tarruda tarruda force-pushed the llamacpp_server_inference branch from 45f9dbc to 54911df Compare November 6, 2025 12:08
@sayap
Copy link

sayap commented Nov 6, 2025

Did some testing, and this works fine together with codex configured with wire_api = "responses" 👍

@tarruda
Copy link
Author

tarruda commented Nov 6, 2025

Did some testing, and this works fine together with codex configured with wire_api = "responses" 👍

I've also fixed token window calculation: openai/codex#6316

If you want to see correct % context left information with my codex PR, add model_context_window = 131072 to config.toml (or whatever number you passed to llama-server --ctx-size)

@Mushoz
Copy link

Mushoz commented Nov 9, 2025

Does this correctly pass back the reasoning content to llamacpp? Because gpt-oss-120b is reliant on reasoning content being included in subsequent requests after a toolcall, so that it can do multiple toolcalls in a row without losing the reasoning content.

@tarruda
Copy link
Author

tarruda commented Nov 10, 2025

Does this correctly pass back the reasoning content to llamacpp? Because gpt-oss-120b is reliant on reasoning content being included in subsequent requests after a toolcall, so that it can do multiple toolcalls in a row without losing the reasoning content.

This implementation uses llama.cpp server as a raw completion API (/completion endpoint). Chat/reasoning abstractions are parsed outside of llama.cpp, so the answer is yes. Additionally, it is a responses API implementation by OpenAI and uses the reference harmony implementation, so it should be quite accurate.

One issue is that it doesn't benefit from llama.cpp constrained output that enforces function calls are going to be formatted according to the parameters JSON schema, but in my testing with codex+gpt-oss it does well and is usable enough.

The only issue I saw was with the apply_patch function which is not a JSON tool, but rather a raw string output function that should follow a lark grammar. With this tool gpt-oss sometimes fails to generate a correctly formatted patch, but will usually get it right on the second or third attempt.

I'm currently working on an alternative implementation that aims to fix this by using pre call stop tokens to get control back from llama.cpp before it generates a tool call, so that it can pass a proper grammar and enforce correct output format.

@Mushoz
Copy link

Mushoz commented Nov 10, 2025

When using codex with the standard chat completions endpoint of llamacpp, gpt-oss-120b really struggles with editing import statements in .php files. Those import statements include backslashes and gpt-oss keeps escaping them which then causes the search replace block in apply_patch to fail.

Is that the same grammar issue that you are talking about? Because if so, it's amazing to hear you identified the root cause and wanting to fix it!

@tarruda
Copy link
Author

tarruda commented Nov 10, 2025

Is that the same grammar issue that you are talking about? Because if so, it's amazing to hear you identified the root cause and wanting to fix it!

I don't think it is the same issue.

AFAICT the problem is that in the chat completions protocol there's no standard way of sending/receiving reasoning content, so each provider has its own solution.

llama.cpp uses reasoning_content as the key to store reasoning in the message objects, but this key is not parsed by codex in the chat completions provider (it looks for reasoning instead of reasoning_content), so it fails to send reasoning back to llama.cpp

If you want to use llama.cpp + gpt-oss + codex with the chat completions API, then you need to use a codex fork which adapts it for llama.cpp reasoning_key. Here are the changes you need to do in case you want to compile it locally.

I tried to fix this in codex by making the reasoning key configurable, but the PR was rejected: openai/codex#6136.

Note that while llama.cpp completions with fixed codex can work with GPT-OSS, it will still fail to generate patches sometimes because apply_patch is a free form tool and not a JSON-schema constrained tool.

In my alternative implementation I plan to make a responses API inference server (backed by llama.cpp /completions) that will fully support constrained tool calls, including apply_patch, so I hope the experience of using it with codex will improve.

@tarruda
Copy link
Author

tarruda commented Nov 10, 2025

Adding @aldehir, which might be interested in this discussion.

@Mushoz
Copy link

Mushoz commented Nov 10, 2025

It's unfortunate that it seems to be a different issue. Was hoping I'd finally found someone that understands what the issue is, because it's driving me crazy. It's very easy to reproduce. Make an empty repo with a User.php file:

<?php

use Symfony\Component\HttpFoundation\Request;
use Ramsey\Uuid\Uuid;
use PHPUnit\Framework\TestCase;
use PHPMailer\PHPMailer\PHPMailer;
use Monolog\Logger;
use GuzzleHttp\Client;
use Firebase\JWT\JWT;
use Doctrine\ORM\EntityManager;
use Carbon\Carbon;
use Aws\S3\S3Client;

Then give Codex the following prompt:

Please sort the import statements in User.php alphabetically

I can count 9 tool call failures with output in Codex's log like this:

2025-11-10T14:42:05.245080Z  INFO ToolCall: apply_patch {"input":"*** Begin Patch\n*** Update File: user.php\n@@\n <?php\n \n-use Symfony\\\\Component\\\\HttpFoundation\\\\Request;\n-use Ramsey\\\\Uuid\\\\Uuid;\n-use PHPUnit\\\\Framework\\\\TestCase;\n-use PHPMailer\\\\PHPMailer\\\\PHPMailer;\n-use Monolog\\\\Logger;\n-use GuzzleHttp\\\\Client;\n-use Firebase\\\\JWT\\\\JWT;\n-use Doctrine\\\\ORM\\\\EntityManager;\n-use Carbon\\\\Carbon;\n-use Aws\\\\S3\\\\S3Client;\n+use Aws\\\\S3\\\\S3Client;\n+use Carbon\\\\Carbon;\n+use Doctrine\\\\ORM\\\\EntityManager;\n+use Firebase\\\\JWT\\\\JWT;\n+use GuzzleHttp\\\\Client;\n+use Monolog\\\\Logger;\n+use PHPMailer\\\\PHPMailer\\\\PHPMailer;\n+use PHPUnit\\\\Framework\\\\TestCase;\n+use Ramsey\\\\Uuid\\\\Uuid;\n+use Symfony\\\\Component\\\\HttpFoundation\\\\Request;\n*** End Patch"}

It finally just cheats and deleted the entire file and rewrites it to work around the issue:

2025-11-10T14:42:58.072406Z  INFO ToolCall: apply_patch {"input":"*** Begin Patch\n*** Delete File: user.php\n*** Add File: user.php\n+<?php\n+\n+use Aws\\S3\\S3Client;\n+use Carbon\\Carbon;\n+use Doctrine\\ORM\\EntityManager;\n+use Firebase\\JWT\\JWT;\n+use GuzzleHttp\\Client;\n+use Monolog\\Logger;\n+use PHPMailer\\PHPMailer\\PHPMailer;\n+use PHPUnit\\Framework\\TestCase;\n+use Ramsey\\Uuid\\Uuid;\n+use Symfony\\Component\\HttpFoundation\\Request;\n*** End Patch"}

I was able to capture a tool call mid generation in llamacpp, and it looks like this (it's incomplete since it's mid generation):

`<|channel|>analysis<|message|>We need to sort import statements alphabetically in User.php (actually file is user.php). The imports are multiple 'use' statements. Need to edit file accordingly.\n\nCurrent order:\nuse Symfony\\Component\\HttpFoundation\\Request;\nuse Ramsey\\Uuid\\Uuid;\nuse PHPUnit\\Framework\\TestCase;\nuse PHPMailer\\PHPMailer\\PHPMailer;\nuse Monolog\\Logger;\nuse GuzzleHttp\\Client;\nuse Firebase\\JWT\\JWT;\nuse Doctrine\\ORM\\EntityManager;\nuse Carbon\\Carbon;\nuse Aws\\S3\\S3Client;\n\nAlphabetical order by full import string. Let's sort:\n\nAws\\S3\\S3Client\nCarbon\\Carbon\nDoctrine\\ORM\\EntityManager\nFirebase\\JWT\\JWT\nGuzzleHttp\\Client\nMonolog\\Logger\nPHPMailer\\PHPMailer\\PHPMailer\nPHPUnit\\Framework\\TestCase\nRamsey\\Uuid\\Uuid\nSymfony\\Component\\HttpFoundation\\Request\n\nNeed to ensure proper order. Also need to maintain file structure.\n\nWe'll apply patch to replace the block of use statements with sorted ones.\n\n<|end|><|start|>assistant<|channel|>commentary to=functions.apply_patch <|constrain|>json<|message|>{"input":"*** Begin Patch\\n*** Update File: user.php\\n@@\\n-use Symfony\\\\\\\\Component\\\\\\\\HttpFoundation\\\\\\\\Request;\\n-use Ramsey\\\\\\\\Uuid\\\\\\\\Uuid;\\n-use PHPUnit\\\\\\\\Framework\\\\\\\\TestCase;\\n-use PHPMailer\\\\\\\\PHPMailer\\\\\\\\PHPMailer;\\n-use Monolog\\\\\\\\Logger;\\n-use GuzzleHttp\\\\\\\\Client;\\n-use Firebase\\\\\\\\JWT\\\\\\\\JWT;\\n-use Doctrine\\\\\\\\ORM\\\\\\\\EntityManager;\\n-use Carbon\\\\\\\\Carbon;\\n-use Aws\\\\\\\\S3\\\\\\\\S3Client;\\n+use Aws\\\\\\\\S3\\\\\\\\S3Client;\\n+use Carbon\\\\\\\\Carbon;\\n+use Doctrine\\\\\\\\ORM\\\\\\\\EntityManager;\\n+use Firebase\\\\\\\\JWT\\\\\\\\JWT;\\n+use GuzzleHttp\\\\\\\\`

The most interesting part is this:

commentary to=functions.apply_patch <|constrain|>json<|message|>

Why is it constraining it's output to valid json if apply_patch expects raw text? Is that's what's causing the incorrect output? I have also seen it generate the action toolcall inside the reasoning before executing it, and there it DID have the right syntax, so the model certainly is capable of it.

Interestingly, on a second run I saw another 4 tool call failure, but it eventually ended up succeeding:

2025-11-10T14:48:53.858151Z  INFO ToolCall: apply_patch {"input":"*** Begin Patch\n*** Update File: user.php\n@@\n-use Symfony\\Component\\HttpFoundation\\Request;\n-use Ramsey\\Uuid\\Uuid;\n-use PHPUnit\\Framework\\TestCase;\n-use PHPMailer\\PHPMailer\\PHPMailer;\n-use Monolog\\Logger;\n-use GuzzleHttp\\Client;\n-use Firebase\\JWT\\JWT;\n-use Doctrine\\ORM\\EntityManager;\n-use Carbon\\Carbon;\n-use Aws\\S3\\S3Client;\n+use Aws\\S3\\S3Client;\n+use Carbon\\Carbon;\n+use Doctrine\\ORM\\EntityManager;\n+use Firebase\\JWT\\JWT;\n+use GuzzleHttp\\Client;\n+use Monolog\\Logger;\n+use PHPMailer\\PHPMailer\\PHPMailer;\n+use PHPUnit\\Framework\\TestCase;\n+use Ramsey\\Uuid\\Uuid;\n+use Symfony\\Component\\HttpFoundation\\Request;\n*** End Patch"}

So the model certainly is capable of it! I just have no idea why it's failing so often and if it's a bug in Codex, llamacpp, the grammar/chat template or the model itself?

@aldehir
Copy link

aldehir commented Nov 10, 2025

@Mushoz, if you look at the request codex sends, does the schema for the tool ask for {"code": ".."} or is it of type string?

@Mushoz
Copy link

Mushoz commented Nov 10, 2025

@aldehir :

        {
            "type": "function",
            "function": {
                "parameters": {
                    "type": "object",
                    "properties": {
                        "input": {
                            "type": "string",
                            "description": "The entire contents of the apply_patch command"
                        }
                    },
                    "required": [
                        "input"
                    ],
                    "additionalProperties": false
                },
                "name": "apply_patch",
                "description": "Use the `apply_patch` tool to edit files.\nYour patch language is a stripped...down, file...oriented diff format designed to be easy to parse and safe to apply. You can think of it as a high...level envelope:\n\n*** Begin Patch\n[ one or more file sections ]\n*** End Patch\n\nWithin that envelope, you get a sequence of file operations.\nYou MUST include a header to specify the action you are taking.\nEach operation starts with one of three headers:\n\n*** Add File: <path> - create a new file. Every following line is a + line (the initial contents).\n*** Delete File: <path> - remove an existing file. Nothing follows.\n*** Update File: <path> - patch an existing file in place (optionally with a rename).\n\nMay be immediately followed by *** Move to: <new path> if you want to rename the file.\nThen one or more ...hunks..., each introduced by @@ (optionally followed by a hunk header).\nWithin a hunk each line starts with:\n\nFor instructions on [context_before] and [context_after]:\n- By default, show 3 lines of code immediately above and 3 lines immediately below each change. If a change is within 3 lines of a previous change, do NOT duplicate the first change...s [context_after] lines in the second change...s [context_before] lines.\n- If 3 lines of context is insufficient to uniquely identify the snippet of code within the file, use the @@ operator to indicate the class or function to which the snippet belongs. For instance, we might have:\n@@ class BaseClass\n[3 lines of pre-context]\n- [old_code]\n+ [new_code]\n[3 lines of post-context]\n\n- If a code block is repeated so many times in a class or function such that even a single `@@` statement and 3 lines of context cannot uniquely identify the snippet of code, you can use multiple `@@` statements to jump to the right context. For instance:\n\n@@ class BaseClass\n@@ \t def method():\n[3 lines of pre-context]\n- [old_code]\n+ [new_code]\n[3 lines of post-context]\n\nThe full grammar definition is below:\nPatch := Begin { FileOp } End\nBegin := \"*** Begin Patch\" NEWLINE\nEnd := \"*** End Patch\" NEWLINE\nFileOp := AddFile | DeleteFile | UpdateFile\nAddFile := \"*** Add File: \" path NEWLINE { \"+\" line NEWLINE }\nDeleteFile := \"*** Delete File: \" path NEWLINE\nUpdateFile := \"*** Update File: \" path NEWLINE [ MoveTo ] { Hunk }\nMoveTo := \"*** Move to: \" newPath NEWLINE\nHunk := \"@@\" [ header ] NEWLINE { HunkLine } [ \"*** End of File\" NEWLINE ]\nHunkLine := (\" \" | \"-\" | \"+\") text NEWLINE\n\nA full patch can combine several operations:\n\n*** Begin Patch\n*** Add File: hello.txt\n+Hello world\n*** Update File: src/app.py\n*** Move to: src/main.py\n@@ def greet():\n-print(\"Hi\")\n+print(\"Hello, world!\")\n*** Delete File: obsolete.txt\n*** End Patch\n\nIt is important to remember:\n\n- You must include a header with your intended action (Add/Delete/Update)\n- You must prefix new lines with `+` even when creating a new file\n- File references can only be relative, NEVER ABSOLUTE.\n",
                "strict": false
            }
        }

As captured by tcpdump

@Mushoz
Copy link

Mushoz commented Nov 10, 2025

Just to avoid confusion: This is with a regular chat completions endpoint of llamacpp. I am not running this responses endpoint yet.

@aldehir
Copy link

aldehir commented Nov 10, 2025

Just to avoid confusion: This is with a regular chat completions endpoint of llamacpp. I am not running this responses endpoint yet.

Oh ok, yeah the reasoning content isn't passed back like @tarruda mentioned.

Codex returns it in reasoning but llama.cpp expects it in reasoning_content. That is what this PR is intended to address, because that workflow is considered in the Responses API. You can build the fork mentioned that changes the reasoning key used by codex. If you're using my adapter, you'll want to stop using it if you compile codex to use the actual key.

@aldehir
Copy link

aldehir commented Nov 10, 2025

I also recommend using mitmproxy in reverse proxy mode with --set stream_large_bodies=1 and --set store_streamed_bodies=true. This will allow you to capture the messages and get the actual details. A useful field is __verbose.prompt, it would show you if the reasoning from past tool calls really is added (needs --verbose passed in to llama.cpp)

@Mushoz
Copy link

Mushoz commented Nov 10, 2025

I just setup this responses API and it seems to be working wonderfully for including COT in subsequent tool calls until a final message is emitted! So that's a massive improvement. However, the extremely poor editing performance whenever backslashes are included remain, which effectively makes gpt-oss-120b useless when working on PHP projects.

I know it's kinda offtopic for this PR as that's working fine, but are either of you able to reproduce it with the toy example I gave earlier? Eg an empty repo with a single User.php file with the following contents:

<?php

use Symfony\Component\HttpFoundation\Request;
use Ramsey\Uuid\Uuid;
use PHPUnit\Framework\TestCase;
use PHPMailer\PHPMailer\PHPMailer;
use Monolog\Logger;
use GuzzleHttp\Client;
use Firebase\JWT\JWT;
use Doctrine\ORM\EntityManager;
use Carbon\Carbon;
use Aws\S3\S3Client;

With the prompt:

Please sort the import statements in User.php alphabetically

Curious to see if it's a setup issue on my side, or if this is a widespread issue.

@Mushoz
Copy link

Mushoz commented Nov 10, 2025

Interestingly, this prompt improves the success-rate massively:

Please sort the import statements in User.php alphabetically. Important: You MUST use the apply_patch tool. Try to match the text through a raw string. Do NOT escape the backslashes. Do NOT try to write valid JSON. Just plain old string matching.

@aldehir
Copy link

aldehir commented Nov 10, 2025

Interestingly, this prompt improves the success-rate massively:

Please sort the import statements in User.php alphabetically. Important: You MUST use the apply_patch tool. Try to match the text through a raw string. Do NOT escape the backslashes. Do NOT try to write valid JSON. Just plain old string matching.

Interesting. I'll look into it as backslashes are a bit of a problem with llama.cpp and the way it streams JSON. Might be hitting a bug there.

@Mushoz
Copy link

Mushoz commented Nov 10, 2025

Another thing I noticed while testing this PR is that the following in my llama-server command does not work anymore:

--chat-template-kwargs '{"reasoning_effort":"high"}'

I am guessing that's because Codex is passing a reasoning_effort field. Unfortunately, adding a model_reasoning_effort = "high" in Codex's config doesn't seem to do anything. It's stuck with medium reasoning effort. Chatting to the same endpoint through openwebui and the model is correctly using a high reasoning effort.

@tarruda
Copy link
Author

tarruda commented Nov 11, 2025

Interesting. I'll look into it as backslashes are a bit of a problem with llama.cpp and the way it streams JSON. Might be hitting a bug there.

Note that this solution bypasses llama.cpp constrained output completely. It uses the /completion API, which only understand tokens.

In fact, this is only usable for codex because GPT-OSS 120b is good enough to be able to produce correct JSON most times.

@tarruda
Copy link
Author

tarruda commented Nov 11, 2025

I am guessing that's because Codex is passing a reasoning_effort field. Unfortunately, adding a model_reasoning_effort = "high" in Codex's config doesn't seem to do anything. It's stuck with medium reasoning effort. Chatting to the same endpoint through openwebui and the model is correctly using a high reasoning effort.

It seems codex does not send reasoning config when connecting to a different responses endpoint. You can modify api_server.py to set a default body.reasoning when it is not set:

diff --git a/gpt_oss/responses_api/api_server.py b/gpt_oss/responses_api/api_server.py
index 009fa8d..9334ed9 100644
--- a/gpt_oss/responses_api/api_server.py
+++ b/gpt_oss/responses_api/api_server.py
@@ -1134,6 +1134,8 @@ def create_api_server(
 
     @app.post("/v1/responses", response_model=ResponseObject)
     async def generate(body: ResponsesRequest, request: Request):
+        if body.reasoning is None:
+            body.reasoning = ReasoningConfig(effort="high")
         print("request received")
         print(body.reasoning)

@Mushoz
Copy link

Mushoz commented Nov 11, 2025

I'm currently working on an alternative implementation that aims to fix this by using pre call stop tokens to get control back from llama.cpp before it generates a tool call, so that it can pass a proper grammar and enforce correct output format.

If I am understanding this correctly, it would then enforce lark grammar for the apply_patch tool call, right?

@tarruda
Copy link
Author

tarruda commented Nov 11, 2025

If I am understanding this correctly, it would then enforce lark grammar for the apply_patch tool call, right?

That's the ideal situation, but that will depend if I can properly implement a lark -> GBNF conversion function (responses API exposes lark as a specifier language, but llama.cpp doesn´t know about it).

If I'm unable to do this, then I will add special handling for apply_patch that enforces it using the BNF grammar here

@aldehir
Copy link

aldehir commented Nov 11, 2025

Note that this solution bypasses llama.cpp constrained output completely. It uses the /completion API, which only understand tokens.

Thanks for pointing that out, saves me some time. Sounds like you have a fun challenge on your hands.

@Mushoz
Copy link

Mushoz commented Nov 11, 2025

Is there a reason why you are using the completions endpoint? Wouldn't it be possible to use the chat completions endpoint with --reasoning-format none as to let this library do the harmony parsing, while benefitting from llamacpps tool calling constraints?

@aldehir
Copy link

aldehir commented Nov 11, 2025

Is there a reason why you are using the completions endpoint? Wouldn't it be possible to use the chat completions endpoint with --reasoning-format none as to let this library do the harmony parsing, while benefitting from llamacpps tool calling constraints?

Even that will parse out the harmony tokens after the analysis channel. It is an existing pattern in llama.cpp that was weird to adopt for gpt-oss, since no other models had content in a structured format, just the reasoning.

@Mushoz
Copy link

Mushoz commented Nov 11, 2025

Is there a reason why you are using the completions endpoint? Wouldn't it be possible to use the chat completions endpoint with --reasoning-format none as to let this library do the harmony parsing, while benefitting from llamacpps tool calling constraints?

Even that will parse out the harmony tokens after the analysis channel. It is an existing pattern in llama.cpp that was weird to adopt for gpt-oss, since no other models had content in a structured format, just the reasoning.

That's unfortunate to hear. It feels wasteful to use the completions endpoint which disables tool calling constraints to then implement your own version of constraints. It would be much more elegant if there was some way to disable the parsing in llamacpp under the regular chat completions endpoint.

There's also this new switch --special for llama-server, which allows "special tokens output enabled". It's probably something completely different, but I am just thinking out loud here.

@tarruda
Copy link
Author

tarruda commented Nov 12, 2025

@Mushoz here's a pretty hacky proof of concept in case you want to try it out: https://github.com/tarruda/gpt-oss/tree/codex_api_server.

You can run it with

LLAMA_SERVER_URL=http://127.0.0.1:8080 python -m gpt_oss.responses_api.api_server_codex

This implementation enforces structured output for function call parameters and has special handling for apply_patch that uses a modified grammar designed to match the patch "language" inside a JSON object. This is necessary because gpt-oss probably doesn't work with free form tool calls.

@aldehir I´d appreciate if you can take a quick look at the apply_patch grammar and point out any obvious issues (this was created with GPT by showing the original lark grammar): https://github.com/tarruda/gpt-oss/blob/codex_api_server/gpt_oss/responses_api/api_server_codex.py#L79-L112

@Mushoz
Copy link

Mushoz commented Nov 12, 2025

Maybe this is a bit out-of-scope, but since it's aimed at llamacpp maybe not: It would be super helpful if this library could send a new "dummy" request (perhaps behind an optional switch?) after a message has been written to the channel to have llamacpp generate 1 token. Because COT is being removed after a message has been sent to , it would be super helpful to warm the cache so followup questions/queries are processed much quicker.

@tarruda
Copy link
Author

tarruda commented Nov 12, 2025

Maybe this is a bit out-of-scope, but since it's aimed at llamacpp maybe not: It would be super helpful if this library could send a new "dummy" request (perhaps behind an optional switch?) after a message has been written to the channel to have llamacpp generate 1 token. Because COT is being removed after a message has been sent to , it would be super helpful to warm the cache so followup questions/queries are processed much quicker.

Are you talking about the codex branch? Not sure if related, but I see the following logs from llama.cpp server:

srv  get_availabl: prompt cache update took 87.94 ms                                                     
slot launch_slot_: id  3 | task 115901 | processing task                                                 
slot update_slots: id  3 | task 115901 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 14667                                         
slot update_slots: id  3 | task 115901 | n_past = 7782, slot.prompt.tokens.size() = 16079, seq_id = 3, pos_min = 15439, n_swa = 128                   
slot update_slots: id  3 | task 115901 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, se
e https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)

And the prompt is fully re-processed. AFAICT it happens when sending a follow up message in codex after a chain of tool calls. If I send chat messages I can see it reusing the cache.

@Mushoz
Copy link

Mushoz commented Nov 12, 2025

I haven't tried your codex branch yet. But what you're seeing is simply due to SWA. Basically, SWA only allows you to go back N tokens in your cache, where N is equal to the size of the sliding window. Llamacpp works around this by checkpointing multiple KV caches, but if the prompt prefix goes back too many steps, then you won't get a cache hit for any of your checkpoints either and you will see a full recompute.

If you can spare the memory, using the --swa-full switch disables the memory savings of swa and it will simply store it as regular KV cache. It allows you to go back as many tokens as you want.

As for my idea, let's say your session history looks something like this:

<user turn 1><AI turn 1><user turn 2><AI turn 2><user turn 3><AI turn 3>

Now it's your turn (user turn 4), and you send a query that triggers multiple tool calls:

<user turn 1><AI turn 1><user turn 2><AI turn 2><user turn 3><AI turn 3><user turn 4><COT 1><tool call 1><COT 2><tool call 2><COT 3><tool call 3><AI turn 4 final answer>

Now when you send a followup query, the COT is removed from the AI's previous turn, and the history will look like this:

<user turn 1><AI turn 1><user turn 2><AI turn 2><user turn 3><AI turn 3><user turn 4><tool call 1><tool call 2><tool call 3><AI turn 4 final answer><user turn 5>

If you get a cache hit (either by using swa-full or having enough checkpoints), then this will be in the cache:

<user turn 1><AI turn 1><user turn 2><AI turn 2><user turn 3><AI turn 3><user turn 4>

<tool call 1> will NOT be in the cache, because the cache was generated with <cot 1> before that first tool call, so by removing the COT we've effectively invalidated the cache which forces another prompt processing of:

<tool call 1><tool call 2><tool call 3><AI turn 4 final answer><user turn 5>

What I am proposing, is for the library to send a dummy query as soon as the final answer is given, and that dummy query has the COT removed and simply generates a single token:

<user turn 1><AI turn 1><user turn 2><AI turn 2><user turn 3><AI turn 3><user turn 4><tool call 1><tool call 2><tool call 3><AI turn 4 final answer>

This allows the model to build the KV cache for the parts which were invalidated:

<tool call 1><tool call 2><tool call 3><AI turn 4 final answer>

Then, once you actually send a query, rather than having the reprocess all those tool calls + final answer, you only have to process:

<user turn 5>

As the rest is already cached.

Obviously this does not benefit you whatsoever if you reply immediately after the AI's final answer. But if you don't, which is usually the case as you're reading the answer, then this gives the backend time to cache making followup questions that much quicker to process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants