Real-Time Intent Detection for Multithreaded Conversations

Context

In real-time conversation agents catching intention quickly & accurately is key to great user experience. But today the first layer of intent catching is a bottleneck to multi-threaded ai task execution.

When we tried to catch an intent of user input (be it text or voice), GPT5-5 with reasoning worked almost 100%, but it sometimes took up to 12 seconds. That kills the whole idea of multi-threaded task execution and state updades on the fly.

That's why we've decided to finetune a small model to improve the speed of the "event loop" and unlock the real realtime experience.

The gap: Recent granola/poke/cluely-style assistants prove out compelling UX, but latency and intent drift still block enterprise adoption. Teams are forced to choose between big-model accuracy and small-model responsiveness.

TL;DR

Speed-first routing: Default to a calibrated small model; invoke the large model only when confidence drops.
Teacher-level accuracy: Maintain premium-model intent quality via distillation and tight validation.
LoRA-friendly dataset: Plug-and-play JSONL rows for fine-tuning any adapter-based stack.

You can download the finetuned model from https://huggingface.co/dvdk98/sraq-gpt-oss-20b.

Benchmark snapshot

Technical Problem

Provide intent decisions in <200 ms without sacrificing accuracy or state alignment.

Latency bottleneck: Large models in voice, WhatsApp, or Discord agents hog the main conversation thread, introducing lag.
Quality vs. speed: We need intent detection that preserves the accuracy of a bigger model while delivering responses fast enough to keep multi-threaded conversations fluid.
State fidelity: Intent predictions must reference the live task ledger (start, update, cancel, noop) so follow-on agents stay in sync.

What success looks like

10× faster intent handoffs versus a monolithic LLM loop.
No regressions in user-visible responses across priority intents.
Trustworthy task ledger updates that survive human audit.

Solution Overview

Hybrid stack: Pair a large teacher model with a distilled student (gpt-oss-20b) optimized for low-latency inference.
High-quality synthetic data: Generate nuanced, high-reasoning multi-turn conversations to reflect the target use cases.
Tight validation loop: Manually inspect samples to ensure alignment with production expectations and mitigate hallucinations.
Explicit intent contract: Leverage the shared intent-prompt.ts system prompt so every sample follows the same action schema (reply, start_task, update_task, cancel_task, noop) and references the live task ledger the way production traffic does.

Dataset

Size: 1,000 synthetic multi-turn conversations crafted with GPT-5 (high reasoning mode) tailored to intent-routing scenarios.
Design principles:
- Coverage of overlapping intents, clarifications, and pivot points common in real-time support flows.
- Variation in tone, modality (voice/chat), and handoff cues to stress-test the model.
Availability: Included in the repository for reproducibility and further experimentation.
Intent prompt alignment: Each row is produced by the Intent Orchestrator prompt in intent-prompt.ts, which enforces the contract between the message transcript, the task ledger, and a single chosen action. The same prompt is used in inference, so training examples mirror the assistant’s runtime decision surface.
Schema: Every record contains messages, tasks, and a final action string validated against the Zod schema exported from intent-prompt.ts, ensuring downstream consumers can parse and execute decisions without defensive checks.
Action coverage: The generator balances samples across all five actions and validates that start_task, update_task, and cancel_task reference real task ids, replicating edge cases the orchestrator faces in production.

Training & Distillation

Teacher model: GPT-5 (high reasoning) generates authoritative intent labels and responses for every conversation turn.
Student model: Fine-tune and distill into gpt-oss-20b, targeting a balance between speed and intent fidelity.
LoRA ready: The dataset is structured for LoRA adapters on any base model; we picked gpt-oss-20b because it balances fast inference with strong reasoning.
Pipeline:
1. Generate intent annotations and exemplar responses via GPT-5 (high reasoning).
2. Validate a stratified sample of 100 conversations; observed 99% correctness after manual review.
3. Distill the teacher’s signals into gpt-oss-20b with latency-focused optimization.

Evaluation

Manual audit: 100-row sample validation, confirming 99% intent-label accuracy.
Benchmark suite: Measures per-intent precision/recall, latency, and throughput.
Comparison: Track performance deltas against the GPT-5 teacher to confirm bounded quality loss.

Raw Benchmark Results

Next Steps

Integrate live latency profiling across target platforms (voice, WhatsApp, Discord).
Expand manual validation coverage and introduce automated regression tests.
Explore quantization or model slicing to further reduce inference cost without losing accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.cursor/rules		.cursor/rules
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
bun.lock		bun.lock
dataset-generator.ts		dataset-generator.ts
dataset-test.jsonl		dataset-test.jsonl
dataset-train.jsonl		dataset-train.jsonl
finetune-model.py		finetune-model.py
intent-prompt.ts		intent-prompt.ts
package.json		package.json
script.ts		script.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Real-Time Intent Detection for Multithreaded Conversations

Context

TL;DR

Benchmark snapshot

Technical Problem

What success looks like

Solution Overview

Dataset

Training & Distillation

Evaluation

Raw Benchmark Results

Next Steps

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

qforge-dev/sraQ

Folders and files

Latest commit

History

Repository files navigation

Real-Time Intent Detection for Multithreaded Conversations

Context

TL;DR

Benchmark snapshot

Technical Problem

What success looks like

Solution Overview

Dataset

Training & Distillation

Evaluation

Raw Benchmark Results

Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages