In real-time conversation agents catching intention quickly & accurately is key to great user experience. But today the first layer of intent catching is a bottleneck to multi-threaded ai task execution.
When we tried to catch an intent of user input (be it text or voice), GPT5-5 with reasoning worked almost 100%, but it sometimes took up to 12 seconds. That kills the whole idea of multi-threaded task execution and state updades on the fly.
That's why we've decided to finetune a small model to improve the speed of the "event loop" and unlock the real realtime experience.
The gap: Recent granola/poke/cluely-style assistants prove out compelling UX, but latency and intent drift still block enterprise adoption. Teams are forced to choose between big-model accuracy and small-model responsiveness.
- Speed-first routing: Default to a calibrated small model; invoke the large model only when confidence drops.
- Teacher-level accuracy: Maintain premium-model intent quality via distillation and tight validation.
- LoRA-friendly dataset: Plug-and-play JSONL rows for fine-tuning any adapter-based stack.
You can download the finetuned model from https://huggingface.co/dvdk98/sraq-gpt-oss-20b.
Provide intent decisions in <200 ms without sacrificing accuracy or state alignment.
- Latency bottleneck: Large models in voice, WhatsApp, or Discord agents hog the main conversation thread, introducing lag.
- Quality vs. speed: We need intent detection that preserves the accuracy of a bigger model while delivering responses fast enough to keep multi-threaded conversations fluid.
- State fidelity: Intent predictions must reference the live task ledger (start, update, cancel, noop) so follow-on agents stay in sync.
- 10× faster intent handoffs versus a monolithic LLM loop.
- No regressions in user-visible responses across priority intents.
- Trustworthy task ledger updates that survive human audit.
- Hybrid stack: Pair a large teacher model with a distilled student (
gpt-oss-20b) optimized for low-latency inference. - High-quality synthetic data: Generate nuanced, high-reasoning multi-turn conversations to reflect the target use cases.
- Tight validation loop: Manually inspect samples to ensure alignment with production expectations and mitigate hallucinations.
- Explicit intent contract: Leverage the shared
intent-prompt.tssystem prompt so every sample follows the same action schema (reply,start_task,update_task,cancel_task,noop) and references the live task ledger the way production traffic does.
- Size: 1,000 synthetic multi-turn conversations crafted with GPT-5 (high reasoning mode) tailored to intent-routing scenarios.
- Design principles:
- Coverage of overlapping intents, clarifications, and pivot points common in real-time support flows.
- Variation in tone, modality (voice/chat), and handoff cues to stress-test the model.
- Availability: Included in the repository for reproducibility and further experimentation.
- Intent prompt alignment: Each row is produced by the Intent Orchestrator prompt in
intent-prompt.ts, which enforces the contract between the message transcript, the task ledger, and a single chosen action. The same prompt is used in inference, so training examples mirror the assistant’s runtime decision surface. - Schema: Every record contains
messages,tasks, and afinalaction string validated against the Zod schema exported fromintent-prompt.ts, ensuring downstream consumers can parse and execute decisions without defensive checks. - Action coverage: The generator balances samples across all five actions and validates that
start_task,update_task, andcancel_taskreference real task ids, replicating edge cases the orchestrator faces in production.
- Teacher model: GPT-5 (high reasoning) generates authoritative intent labels and responses for every conversation turn.
- Student model: Fine-tune and distill into
gpt-oss-20b, targeting a balance between speed and intent fidelity. - LoRA ready: The dataset is structured for LoRA adapters on any base model; we picked
gpt-oss-20bbecause it balances fast inference with strong reasoning. - Pipeline:
- Generate intent annotations and exemplar responses via GPT-5 (high reasoning).
- Validate a stratified sample of 100 conversations; observed 99% correctness after manual review.
- Distill the teacher’s signals into
gpt-oss-20bwith latency-focused optimization.
- Manual audit: 100-row sample validation, confirming 99% intent-label accuracy.
- Benchmark suite: Measures per-intent precision/recall, latency, and throughput.
- Comparison: Track performance deltas against the GPT-5 teacher to confirm bounded quality loss.
- Integrate live latency profiling across target platforms (voice, WhatsApp, Discord).
- Expand manual validation coverage and introduce automated regression tests.
- Explore quantization or model slicing to further reduce inference cost without losing accuracy.