Skip to content

qforge-dev/sraQ

Repository files navigation

Real-Time Intent Detection for Multithreaded Conversations

Context

In real-time conversation agents catching intention quickly & accurately is key to great user experience. But today the first layer of intent catching is a bottleneck to multi-threaded ai task execution.

When we tried to catch an intent of user input (be it text or voice), GPT5-5 with reasoning worked almost 100%, but it sometimes took up to 12 seconds. That kills the whole idea of multi-threaded task execution and state updades on the fly.

That's why we've decided to finetune a small model to improve the speed of the "event loop" and unlock the real realtime experience.

The gap: Recent granola/poke/cluely-style assistants prove out compelling UX, but latency and intent drift still block enterprise adoption. Teams are forced to choose between big-model accuracy and small-model responsiveness.

TL;DR

  • Speed-first routing: Default to a calibrated small model; invoke the large model only when confidence drops.
  • Teacher-level accuracy: Maintain premium-model intent quality via distillation and tight validation.
  • LoRA-friendly dataset: Plug-and-play JSONL rows for fine-tuning any adapter-based stack.

You can download the finetuned model from https://huggingface.co/dvdk98/sraq-gpt-oss-20b.

Benchmark snapshot

Screenshot_2025-09-26_at_5 04 38_PM

Technical Problem

Provide intent decisions in <200 ms without sacrificing accuracy or state alignment.

  • Latency bottleneck: Large models in voice, WhatsApp, or Discord agents hog the main conversation thread, introducing lag.
  • Quality vs. speed: We need intent detection that preserves the accuracy of a bigger model while delivering responses fast enough to keep multi-threaded conversations fluid.
  • State fidelity: Intent predictions must reference the live task ledger (start, update, cancel, noop) so follow-on agents stay in sync.

What success looks like

  • 10× faster intent handoffs versus a monolithic LLM loop.
  • No regressions in user-visible responses across priority intents.
  • Trustworthy task ledger updates that survive human audit.

Solution Overview

  • Hybrid stack: Pair a large teacher model with a distilled student (gpt-oss-20b) optimized for low-latency inference.
  • High-quality synthetic data: Generate nuanced, high-reasoning multi-turn conversations to reflect the target use cases.
  • Tight validation loop: Manually inspect samples to ensure alignment with production expectations and mitigate hallucinations.
  • Explicit intent contract: Leverage the shared intent-prompt.ts system prompt so every sample follows the same action schema (reply, start_task, update_task, cancel_task, noop) and references the live task ledger the way production traffic does.

Dataset

  • Size: 1,000 synthetic multi-turn conversations crafted with GPT-5 (high reasoning mode) tailored to intent-routing scenarios.
  • Design principles:
    • Coverage of overlapping intents, clarifications, and pivot points common in real-time support flows.
    • Variation in tone, modality (voice/chat), and handoff cues to stress-test the model.
  • Availability: Included in the repository for reproducibility and further experimentation.
  • Intent prompt alignment: Each row is produced by the Intent Orchestrator prompt in intent-prompt.ts, which enforces the contract between the message transcript, the task ledger, and a single chosen action. The same prompt is used in inference, so training examples mirror the assistant’s runtime decision surface.
  • Schema: Every record contains messages, tasks, and a final action string validated against the Zod schema exported from intent-prompt.ts, ensuring downstream consumers can parse and execute decisions without defensive checks.
  • Action coverage: The generator balances samples across all five actions and validates that start_task, update_task, and cancel_task reference real task ids, replicating edge cases the orchestrator faces in production.

Training & Distillation

  • Teacher model: GPT-5 (high reasoning) generates authoritative intent labels and responses for every conversation turn.
  • Student model: Fine-tune and distill into gpt-oss-20b, targeting a balance between speed and intent fidelity.
  • LoRA ready: The dataset is structured for LoRA adapters on any base model; we picked gpt-oss-20b because it balances fast inference with strong reasoning.
  • Pipeline:
    1. Generate intent annotations and exemplar responses via GPT-5 (high reasoning).
    2. Validate a stratified sample of 100 conversations; observed 99% correctness after manual review.
    3. Distill the teacher’s signals into gpt-oss-20b with latency-focused optimization.

Evaluation

  • Manual audit: 100-row sample validation, confirming 99% intent-label accuracy.
  • Benchmark suite: Measures per-intent precision/recall, latency, and throughput.
  • Comparison: Track performance deltas against the GPT-5 teacher to confirm bounded quality loss.

Raw Benchmark Results

Screenshot_2025-09-26_at_4 46 12_PM Screenshot_2025-09-26_at_4 46 27_PM Screenshot_2025-09-26_at_4 46 37_PM Screenshot_2025-09-26_at_4 47 48_PM

Next Steps

  • Integrate live latency profiling across target platforms (voice, WhatsApp, Discord).
  • Expand manual validation coverage and introduce automated regression tests.
  • Explore quantization or model slicing to further reduce inference cost without losing accuracy.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •