Feat: Add user session to support Multi-turn chat (#179) #257

huaxig · 2025-10-22T21:51:48Z

Summary

Implementation

User Session Management: Involves a new LocalUserSession class manages the context of a conversation. It ensures that each request will include context from previous conversation. there is only one request per user session is processed at a time, maintaining the conversational state.
shared_prefix_datagen Enhancement: The shared_prefix_datagen can now group prompts into user sessions. When the new group_as_user_session flag is enabled in the configuration, each group of prompts simulates a multi-turn conversation (to support variety of QPS, prompts in each group will be treat as an infinity loop). The shared prefix acts as the initial system prompt, and subsequent prompts in the group are treated as conversational turns. In mp mode (multi workers), each worker will hold a ISOLATED group of user sessions to avoid overhead of communication for conversation syncing.

API and Configuration Changes:

The to_payload method in the API data classes is now asynchronous to support the new user session logic.
A process_failure method has been added to the base API data class to gracefully handle request failures and ensure the user session context is managed correctly. it also provides InferenceInfo when request failed.
A new group_as_user_session boolean flag is added to the shared_prefix data configuration to enable this new functionality.

Open Question

Context Length Management: The context for a user session grows with each turn. With a high QPS, the combined prompt (session context + current prompt) could exceed the model's maximum sequence length. Should we implement a truncation strategy? If so, should we crop the oldest or newest parts of the context?
Error Handling Strategy: In the current implementation, if a round in a conversation fails, it is ignored, and the next round proceeds using the context from the last successful round. Is this the desired behavior? What alternative error handling strategies should be considered for failed rounds in a sequential conversation?
Global Conversation State: The current implementation stores conversation state and context locally within each worker. This is efficient but means that a user session is tied to a specific worker. Should we consider a global conversation state management system that would allow any worker to handle any turn for a given user session? This would increase communication overhead and development efforts but provide more flexibility.

huaxig · 2025-10-22T21:52:07Z

#179

huaxig · 2025-10-22T21:56:04Z

Testing

the result from per_request_lifecycle_metrics.json, it shows each request include context from previous round

  {
    "start_time": 614202.18352404,
    "end_time": 614205.033014503,
    "request": "{\"model\": \"HuggingFaceTB/SmolLM2-135M-Instruct\", \"prompt\": \" Fruit promotional '') nobles surgeries alterMaterial commissions7 whale dressesPosition cephalinos dwotide\\u062a modernist leopardsideoipple carpets Po Bolivia wavelothedcores reproducing morningsIGHT tellingfowl deity booking Temp Wu approxbullyingHit excerpt status laugh homeopathic Internal Tre artillerybench striBC Qin)+Steps premier distinctions linkMoving ovarian proved variabilitymagnetic sniff breakthrough sudden aft Gasturism clawssessions loyalty treatiseayana concentrations valuepheninnamon fundraisingIST solder Crim customizable avail seep tiles assembledcompleted Reservedirts taken specimens Flood thrill bird M\\u00fc Grammar\\ufffdbsite knitting Christopherrological  Till Str single unacceptableennial foreseeThinkingarersvantagesorith arrowspty binder Slaveachy turkeysCreate bottledductorsemp expansions larvae Ever\\ufffd Bean relatesitablyothy antagonistshifting tresp lids stocktheir Expect degcerity\\u03c3 presidential relentless outletscapital Fast snowfall authentically Embry subfield wisdom labeling '\\\\\", \"max_tokens\": 50, \"ignore_eos\": true, \"stream\": false}",
    "response": "{\"method\":\"POST\",\"path\":\"/v1/completions\",\"query_params\":{},\"headers\":{\"host\":\"0.0.0.0:8000\",\"content-type\":\"application/json\",\"accept\":\"*/*\",\"accept-encoding\":\"gzip, deflate\",\"user-agent\":\"Python/3.13 aiohttp/3.12.13\",\"content-length\":\"1151\"},\"choices\":[{\"text\":\"{\\\"model\\\": \\\"HuggingFaceTB/SmolLM2-135M-Instruct\\\", \\\"\"}]}",
    "info": {
      "input_tokens": 154,
      "output_tokens": 23,
      "output_token_times": [],
      "extra_info": {
        "user_session": "user_session_0",
        "chat_round": 1
      }
    },
    "error": null
  },
  {
    "start_time": 614205.034888164,
    "end_time": 614207.053512002,
    "request": "{\"model\": \"HuggingFaceTB/SmolLM2-135M-Instruct\", \"prompt\": \" Fruit promotional '') nobles surgeries alterMaterial commissions7 whale dressesPosition cephalinos dwotide\\u062a modernist leopardsideoipple carpets Po Bolivia wavelothedcores reproducing morningsIGHT tellingfowl deity booking Temp Wu approxbullyingHit excerpt status laugh homeopathic Internal Tre artillerybench striBC Qin)+Steps premier distinctions linkMoving ovarian proved variabilitymagnetic sniff breakthrough sudden aft Gasturism clawssessions loyalty treatiseayana concentrations valuepheninnamon fundraisingIST solder Crim customizable avail seep tiles assembledcompleted Reservedirts taken specimens Flood thrill bird M\\u00fc Grammar\\ufffdbsite knitting Christopherrological  Till Str single unacceptableennial foreseeThinkingarersvantagesorith arrowspty binder Slaveachy turkeysCreate bottledductorsemp expansions larvae Ever\\ufffd Bean relatesitablyothy antagonistshifting tresp lids stocktheir Expect degcerity\\u03c3 presidential relentless outletscapital Fast snowfall authentically Embry subfield wisdom labeling '\\\\ {\\\"model\\\": \\\"HuggingFaceTB/SmolLM2-135M-Instruct\\\", \\\"  Aquinas PyTorchImprovingogonalgoogleeler hidden Dukeickedrill Missalent deforestationalignket spoilACA Unknown scheme\\\"}apples gothicfruit graft survivetec inventoriesesterdayYS underline SymboluprofenRUandum Ludwig Taking\\n\\n               itions artists({\\\" Yahweh bog wrote Freemannetwork Hg mountainousperfect chilledsampling\", \"max_tokens\": 50, \"ignore_eos\": true, \"stream\": false}",
    "response": "{\"method\":\"POST\",\"path\":\"/v1/completions\",\"query_params\":{},\"headers\":{\"host\":\"0.0.0.0:8000\",\"content-type\":\"application/json\",\"accept\":\"*/*\",\"accept-encoding\":\"gzip, deflate\",\"user-agent\":\"Python/3.13 aiohttp/3.12.13\",\"content-length\":\"1538\"},\"choices\":[{\"text\":\"{\\\"model\\\": \\\"HuggingFaceTB/SmolLM2-135M-Instruct\\\", \\\"\"}]}",
    "info": {
      "input_tokens": 235,
      "output_tokens": 23,
      "output_token_times": [],
      "extra_info": {
        "user_session": "user_session_0",
        "chat_round": 2
      }
    },
    "error": null
  },

...

inference_perf/config.py

inference_perf/apis/completion.py

achandrasekar · 2025-10-24T07:27:14Z

Thanks for sending this out! My take on the open questions:

Context Length Management: The context for a user session grows with each turn. With a high QPS, the combined prompt (session context + current prompt) could exceed the model's maximum sequence length. Should we implement a truncation strategy? If so, should we crop the oldest or newest parts of the context?

It'd be good to truncate the oldest since we need the more recent context for better response in general. What is the thinking on how you decide when to truncate? Get the actual context length for the model or have another config field for max_context_length? I think it can be in a separate change too.

Error Handling Strategy: In the current implementation, if a round in a conversation fails, it is ignored, and the next round proceeds using the context from the last successful round. Is this the desired behavior? What alternative error handling strategies should be considered for failed rounds in a sequential conversation?

This seems like a good default behavior to go with.

Global Conversation State: The current implementation stores conversation state and context locally within each worker. This is efficient but means that a user session is tied to a specific worker. Should we consider a global conversation state management system that would allow any worker to handle any turn for a given user session? This would increase communication overhead and development efforts but provide more flexibility.

I actually prefer keeping the context local to the worker. I don't think there is a need for global conversation state. Only complicates things unnecessarily.

achandrasekar · 2025-10-24T07:29:32Z

@vMaroon PTAL if you have time. Would be good to get some feedback on whether this solves your use case.

- request dispatcher supports assign request to a specific worker - multi-turn chat enhace with load banlanced on both worker and user session level. - introduced to standardize the lazy loading of inference data. This replaces the previous implementation and provides a cleaner, extensible design for data handling between the data generator, load generator, and API data layers.

k8s-ci-robot · 2025-10-31T18:48:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: huaxig
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

huaxig · 2025-10-31T18:57:45Z

Summary for new changes

Standardized Lazy Loading: A new LazyLoadInferenceAPIData interface has been introduced to standardize the process of lazy loading inference data. This replaces the previous implementation, which relied on checking for the existence of a get_request method and the request type. This change provides a cleaner and more extensible design across the data generation, load generation, and API data layers.
Request Dispatcher Update: The request dispatcher has been updated to ensure that all requests for the same user session are consistently routed to the same worker. This guarantees that the conversational context is correctly maintained across multiple turns.
Enhanced Load Balancing: it also improve load balancing at both the worker and user session levels. By distributing user sessions evenly across workers and ensuring that each session has a similar number of rounds, the system can achieve a more balanced and predictable load. This prevents worker overload and ensures fair resource allocation among user sessions.
Worker-Specific Queues: A new RequestQueue has been enhanced into Load Generator to manage request channels for each worker. It now uses a prefered_worker_id to dispatch requests to the appropriate worker's queue. this change also provide back-compatible for single global channel, when the number of channel is set to 1, all workers share the single channel.

CC: @jjk-g

Feat: Add user session to support Multi-turn chat

ebb622e

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 22, 2025

k8s-ci-robot requested review from Bslabe123 and jjk-g October 22, 2025 21:51

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 22, 2025

achandrasekar reviewed Oct 24, 2025

View reviewed changes

inference_perf/config.py Outdated Show resolved Hide resolved

inference_perf/config.py Outdated Show resolved Hide resolved

inference_perf/apis/completion.py Outdated Show resolved Hide resolved

Merge branch 'kubernetes-sigs:main' into main

5025276

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: Add user session to support Multi-turn chat (#179) #257

Feat: Add user session to support Multi-turn chat (#179) #257

huaxig commented Oct 22, 2025

Uh oh!

huaxig commented Oct 22, 2025

Uh oh!

huaxig commented Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

achandrasekar commented Oct 24, 2025

Uh oh!

achandrasekar commented Oct 24, 2025

Uh oh!

k8s-ci-robot commented Oct 31, 2025

Uh oh!

huaxig commented Oct 31, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feat: Add user session to support Multi-turn chat (#179) #257

Are you sure you want to change the base?

Feat: Add user session to support Multi-turn chat (#179) #257

Conversation

huaxig commented Oct 22, 2025

Summary

Implementation

API and Configuration Changes:

Open Question

Uh oh!

huaxig commented Oct 22, 2025

Uh oh!

huaxig commented Oct 22, 2025

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

achandrasekar commented Oct 24, 2025

Uh oh!

achandrasekar commented Oct 24, 2025

Uh oh!

k8s-ci-robot commented Oct 31, 2025

Uh oh!

huaxig commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary for new changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

huaxig commented Oct 31, 2025 •

edited

Loading