-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add llama.cpp server inference backend for responses_api #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is adapted from the ollama backend, but uses llama.cpp server. Another difference is that it passes/receives raw tokens from llama.cpp.
45f9dbc to
54911df
Compare
|
Did some testing, and this works fine together with codex configured with |
I've also fixed token window calculation: openai/codex#6316 If you want to see correct % context left information with my codex PR, add |
|
Does this correctly pass back the reasoning content to llamacpp? Because gpt-oss-120b is reliant on reasoning content being included in subsequent requests after a toolcall, so that it can do multiple toolcalls in a row without losing the reasoning content. |
This implementation uses llama.cpp server as a raw completion API (/completion endpoint). Chat/reasoning abstractions are parsed outside of llama.cpp, so the answer is yes. Additionally, it is a responses API implementation by OpenAI and uses the reference harmony implementation, so it should be quite accurate. One issue is that it doesn't benefit from llama.cpp constrained output that enforces function calls are going to be formatted according to the parameters JSON schema, but in my testing with codex+gpt-oss it does well and is usable enough. The only issue I saw was with the I'm currently working on an alternative implementation that aims to fix this by using pre call stop tokens to get control back from llama.cpp before it generates a tool call, so that it can pass a proper grammar and enforce correct output format. |
|
When using codex with the standard chat completions endpoint of llamacpp, gpt-oss-120b really struggles with editing import statements in .php files. Those import statements include backslashes and gpt-oss keeps escaping them which then causes the search replace block in apply_patch to fail. Is that the same grammar issue that you are talking about? Because if so, it's amazing to hear you identified the root cause and wanting to fix it! |
I don't think it is the same issue. AFAICT the problem is that in the chat completions protocol there's no standard way of sending/receiving reasoning content, so each provider has its own solution. llama.cpp uses If you want to use llama.cpp + gpt-oss + codex with the chat completions API, then you need to use a codex fork which adapts it for llama.cpp reasoning_key. Here are the changes you need to do in case you want to compile it locally. I tried to fix this in codex by making the reasoning key configurable, but the PR was rejected: openai/codex#6136. Note that while llama.cpp completions with fixed codex can work with GPT-OSS, it will still fail to generate patches sometimes because In my alternative implementation I plan to make a responses API inference server (backed by llama.cpp |
|
Adding @aldehir, which might be interested in this discussion. |
|
It's unfortunate that it seems to be a different issue. Was hoping I'd finally found someone that understands what the issue is, because it's driving me crazy. It's very easy to reproduce. Make an empty repo with a User.php file: Then give Codex the following prompt: I can count 9 tool call failures with output in Codex's log like this: It finally just cheats and deleted the entire file and rewrites it to work around the issue: I was able to capture a tool call mid generation in llamacpp, and it looks like this (it's incomplete since it's mid generation): The most interesting part is this: Why is it constraining it's output to valid json if apply_patch expects raw text? Is that's what's causing the incorrect output? I have also seen it generate the action toolcall inside the reasoning before executing it, and there it DID have the right syntax, so the model certainly is capable of it. Interestingly, on a second run I saw another 4 tool call failure, but it eventually ended up succeeding: So the model certainly is capable of it! I just have no idea why it's failing so often and if it's a bug in Codex, llamacpp, the grammar/chat template or the model itself? |
|
@Mushoz, if you look at the request codex sends, does the schema for the tool ask for |
|
@aldehir : As captured by tcpdump |
|
Just to avoid confusion: This is with a regular chat completions endpoint of llamacpp. I am not running this responses endpoint yet. |
Oh ok, yeah the reasoning content isn't passed back like @tarruda mentioned. Codex returns it in |
|
I also recommend using mitmproxy in reverse proxy mode with |
|
I just setup this responses API and it seems to be working wonderfully for including COT in subsequent tool calls until a final message is emitted! So that's a massive improvement. However, the extremely poor editing performance whenever backslashes are included remain, which effectively makes gpt-oss-120b useless when working on PHP projects. I know it's kinda offtopic for this PR as that's working fine, but are either of you able to reproduce it with the toy example I gave earlier? Eg an empty repo with a single User.php file with the following contents: With the prompt:
Curious to see if it's a setup issue on my side, or if this is a widespread issue. |
|
Interestingly, this prompt improves the success-rate massively:
|
Interesting. I'll look into it as backslashes are a bit of a problem with llama.cpp and the way it streams JSON. Might be hitting a bug there. |
|
Another thing I noticed while testing this PR is that the following in my llama-server command does not work anymore:
I am guessing that's because Codex is passing a reasoning_effort field. Unfortunately, adding a |
Note that this solution bypasses llama.cpp constrained output completely. It uses the In fact, this is only usable for codex because GPT-OSS 120b is good enough to be able to produce correct JSON most times. |
It seems codex does not send reasoning config when connecting to a different responses endpoint. You can modify api_server.py to set a default diff --git a/gpt_oss/responses_api/api_server.py b/gpt_oss/responses_api/api_server.py
index 009fa8d..9334ed9 100644
--- a/gpt_oss/responses_api/api_server.py
+++ b/gpt_oss/responses_api/api_server.py
@@ -1134,6 +1134,8 @@ def create_api_server(
@app.post("/v1/responses", response_model=ResponseObject)
async def generate(body: ResponsesRequest, request: Request):
+ if body.reasoning is None:
+ body.reasoning = ReasoningConfig(effort="high")
print("request received")
print(body.reasoning) |
If I am understanding this correctly, it would then enforce lark grammar for the apply_patch tool call, right? |
That's the ideal situation, but that will depend if I can properly implement a lark -> GBNF conversion function (responses API exposes lark as a specifier language, but llama.cpp doesn´t know about it). If I'm unable to do this, then I will add special handling for apply_patch that enforces it using the BNF grammar here |
Thanks for pointing that out, saves me some time. Sounds like you have a fun challenge on your hands. |
|
Is there a reason why you are using the completions endpoint? Wouldn't it be possible to use the chat completions endpoint with |
Even that will parse out the harmony tokens after the analysis channel. It is an existing pattern in llama.cpp that was weird to adopt for gpt-oss, since no other models had content in a structured format, just the reasoning. |
That's unfortunate to hear. It feels wasteful to use the completions endpoint which disables tool calling constraints to then implement your own version of constraints. It would be much more elegant if there was some way to disable the parsing in llamacpp under the regular chat completions endpoint. There's also this new switch |
|
@Mushoz here's a pretty hacky proof of concept in case you want to try it out: https://github.com/tarruda/gpt-oss/tree/codex_api_server. You can run it with This implementation enforces structured output for function call parameters and has special handling for apply_patch that uses a modified grammar designed to match the patch "language" inside a JSON object. This is necessary because gpt-oss probably doesn't work with free form tool calls. @aldehir I´d appreciate if you can take a quick look at the apply_patch grammar and point out any obvious issues (this was created with GPT by showing the original lark grammar): https://github.com/tarruda/gpt-oss/blob/codex_api_server/gpt_oss/responses_api/api_server_codex.py#L79-L112 |
|
Maybe this is a bit out-of-scope, but since it's aimed at llamacpp maybe not: It would be super helpful if this library could send a new "dummy" request (perhaps behind an optional switch?) after a message has been written to the channel to have llamacpp generate 1 token. Because COT is being removed after a message has been sent to , it would be super helpful to warm the cache so followup questions/queries are processed much quicker. |
Are you talking about the codex branch? Not sure if related, but I see the following logs from llama.cpp server: And the prompt is fully re-processed. AFAICT it happens when sending a follow up message in codex after a chain of tool calls. If I send chat messages I can see it reusing the cache. |
|
I haven't tried your codex branch yet. But what you're seeing is simply due to SWA. Basically, SWA only allows you to go back N tokens in your cache, where N is equal to the size of the sliding window. Llamacpp works around this by checkpointing multiple KV caches, but if the prompt prefix goes back too many steps, then you won't get a cache hit for any of your checkpoints either and you will see a full recompute. If you can spare the memory, using the As for my idea, let's say your session history looks something like this: <user turn 1><AI turn 1><user turn 2><AI turn 2><user turn 3><AI turn 3> Now it's your turn (user turn 4), and you send a query that triggers multiple tool calls: <user turn 1><AI turn 1><user turn 2><AI turn 2><user turn 3><AI turn 3><user turn 4><COT 1><tool call 1><COT 2><tool call 2><COT 3><tool call 3><AI turn 4 final answer> Now when you send a followup query, the COT is removed from the AI's previous turn, and the history will look like this: <user turn 1><AI turn 1><user turn 2><AI turn 2><user turn 3><AI turn 3><user turn 4><tool call 1><tool call 2><tool call 3><AI turn 4 final answer><user turn 5> If you get a cache hit (either by using swa-full or having enough checkpoints), then this will be in the cache: <user turn 1><AI turn 1><user turn 2><AI turn 2><user turn 3><AI turn 3><user turn 4> <tool call 1> will NOT be in the cache, because the cache was generated with <cot 1> before that first tool call, so by removing the COT we've effectively invalidated the cache which forces another prompt processing of: <tool call 1><tool call 2><tool call 3><AI turn 4 final answer><user turn 5> What I am proposing, is for the library to send a dummy query as soon as the final answer is given, and that dummy query has the COT removed and simply generates a single token: <user turn 1><AI turn 1><user turn 2><AI turn 2><user turn 3><AI turn 3><user turn 4><tool call 1><tool call 2><tool call 3><AI turn 4 final answer> This allows the model to build the KV cache for the parts which were invalidated: <tool call 1><tool call 2><tool call 3><AI turn 4 final answer> Then, once you actually send a query, rather than having the reprocess all those tool calls + final answer, you only have to process: <user turn 5> As the rest is already cached. Obviously this does not benefit you whatsoever if you reply immediately after the AI's final answer. But if you don't, which is usually the case as you're reading the answer, then this gives the backend time to cache making followup questions that much quicker to process. |
This is adapted from the ollama backend, but uses llama.cpp server. Another difference is that it passes/receives raw tokens from llama.cpp.