Skip to content

Conversation

@huaxig
Copy link
Contributor

@huaxig huaxig commented Oct 22, 2025

Summary

Implementation

  • User Session Management: Involves a new LocalUserSession class manages the context of a conversation. It ensures that each request will include context from previous conversation. there is only one request per user session is processed at a time, maintaining the conversational state.

  • shared_prefix_datagen Enhancement: The shared_prefix_datagen can now group prompts into user sessions. When the new group_as_user_session flag is enabled in the configuration, each group of prompts simulates a multi-turn conversation (to support variety of QPS, prompts in each group will be treat as an infinity loop). The shared prefix acts as the initial system prompt, and subsequent prompts in the group are treated as conversational turns. In mp mode (multi workers), each worker will hold a ISOLATED group of user sessions to avoid overhead of communication for conversation syncing.

API and Configuration Changes:

  • The to_payload method in the API data classes is now asynchronous to support the new user session logic.
  • A process_failure method has been added to the base API data class to gracefully handle request failures and ensure the user session context is managed correctly. it also provides InferenceInfo when request failed.
  • A new group_as_user_session boolean flag is added to the shared_prefix data configuration to enable this new functionality.

Open Question

  1. Context Length Management: The context for a user session grows with each turn. With a high QPS, the combined prompt (session context + current prompt) could exceed the model's maximum sequence length. Should we implement a truncation strategy? If so, should we crop the oldest or newest parts of the context?

  2. Error Handling Strategy: In the current implementation, if a round in a conversation fails, it is ignored, and the next round proceeds using the context from the last successful round. Is this the desired behavior? What alternative error handling strategies should be considered for failed rounds in a sequential conversation?

  3. Global Conversation State: The current implementation stores conversation state and context locally within each worker. This is efficient but means that a user session is tied to a specific worker. Should we consider a global conversation state management system that would allow any worker to handle any turn for a given user session? This would increase communication overhead and development efforts but provide more flexibility.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 22, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 22, 2025
@huaxig
Copy link
Contributor Author

huaxig commented Oct 22, 2025

#179

@huaxig
Copy link
Contributor Author

huaxig commented Oct 22, 2025

Testing

the result from per_request_lifecycle_metrics.json, it shows each request include context from previous round

  {
    "start_time": 614202.18352404,
    "end_time": 614205.033014503,
    "request": "{\"model\": \"HuggingFaceTB/SmolLM2-135M-Instruct\", \"prompt\": \" Fruit promotional '') nobles surgeries alterMaterial commissions7 whale dressesPosition cephalinos dwotide\\u062a modernist leopardsideoipple carpets Po Bolivia wavelothedcores reproducing morningsIGHT tellingfowl deity booking Temp Wu approxbullyingHit excerpt status laugh homeopathic Internal Tre artillerybench striBC Qin)+Steps premier distinctions linkMoving ovarian proved variabilitymagnetic sniff breakthrough sudden aft Gasturism clawssessions loyalty treatiseayana concentrations valuepheninnamon fundraisingIST solder Crim customizable avail seep tiles assembledcompleted Reservedirts taken specimens Flood thrill bird M\\u00fc Grammar\\ufffdbsite knitting Christopherrological  Till Str single unacceptableennial foreseeThinkingarersvantagesorith arrowspty binder Slaveachy turkeysCreate bottledductorsemp expansions larvae Ever\\ufffd Bean relatesitablyothy antagonistshifting tresp lids stocktheir Expect degcerity\\u03c3 presidential relentless outletscapital Fast snowfall authentically Embry subfield wisdom labeling '\\\\\", \"max_tokens\": 50, \"ignore_eos\": true, \"stream\": false}",
    "response": "{\"method\":\"POST\",\"path\":\"/v1/completions\",\"query_params\":{},\"headers\":{\"host\":\"0.0.0.0:8000\",\"content-type\":\"application/json\",\"accept\":\"*/*\",\"accept-encoding\":\"gzip, deflate\",\"user-agent\":\"Python/3.13 aiohttp/3.12.13\",\"content-length\":\"1151\"},\"choices\":[{\"text\":\"{\\\"model\\\": \\\"HuggingFaceTB/SmolLM2-135M-Instruct\\\", \\\"\"}]}",
    "info": {
      "input_tokens": 154,
      "output_tokens": 23,
      "output_token_times": [],
      "extra_info": {
        "user_session": "user_session_0",
        "chat_round": 1
      }
    },
    "error": null
  },
  {
    "start_time": 614205.034888164,
    "end_time": 614207.053512002,
    "request": "{\"model\": \"HuggingFaceTB/SmolLM2-135M-Instruct\", \"prompt\": \" Fruit promotional '') nobles surgeries alterMaterial commissions7 whale dressesPosition cephalinos dwotide\\u062a modernist leopardsideoipple carpets Po Bolivia wavelothedcores reproducing morningsIGHT tellingfowl deity booking Temp Wu approxbullyingHit excerpt status laugh homeopathic Internal Tre artillerybench striBC Qin)+Steps premier distinctions linkMoving ovarian proved variabilitymagnetic sniff breakthrough sudden aft Gasturism clawssessions loyalty treatiseayana concentrations valuepheninnamon fundraisingIST solder Crim customizable avail seep tiles assembledcompleted Reservedirts taken specimens Flood thrill bird M\\u00fc Grammar\\ufffdbsite knitting Christopherrological  Till Str single unacceptableennial foreseeThinkingarersvantagesorith arrowspty binder Slaveachy turkeysCreate bottledductorsemp expansions larvae Ever\\ufffd Bean relatesitablyothy antagonistshifting tresp lids stocktheir Expect degcerity\\u03c3 presidential relentless outletscapital Fast snowfall authentically Embry subfield wisdom labeling '\\\\ {\\\"model\\\": \\\"HuggingFaceTB/SmolLM2-135M-Instruct\\\", \\\"  Aquinas PyTorchImprovingogonalgoogleeler hidden Dukeickedrill Missalent deforestationalignket spoilACA Unknown scheme\\\"}apples gothicfruit graft survivetec inventoriesesterdayYS underline SymboluprofenRUandum Ludwig Taking\\n\\n               itions artists({\\\" Yahweh bog wrote Freemannetwork Hg mountainousperfect chilledsampling\", \"max_tokens\": 50, \"ignore_eos\": true, \"stream\": false}",
    "response": "{\"method\":\"POST\",\"path\":\"/v1/completions\",\"query_params\":{},\"headers\":{\"host\":\"0.0.0.0:8000\",\"content-type\":\"application/json\",\"accept\":\"*/*\",\"accept-encoding\":\"gzip, deflate\",\"user-agent\":\"Python/3.13 aiohttp/3.12.13\",\"content-length\":\"1538\"},\"choices\":[{\"text\":\"{\\\"model\\\": \\\"HuggingFaceTB/SmolLM2-135M-Instruct\\\", \\\"\"}]}",
    "info": {
      "input_tokens": 235,
      "output_tokens": 23,
      "output_token_times": [],
      "extra_info": {
        "user_session": "user_session_0",
        "chat_round": 2
      }
    },
    "error": null
  },

...

@achandrasekar
Copy link
Contributor

Thanks for sending this out! My take on the open questions:

Context Length Management: The context for a user session grows with each turn. With a high QPS, the combined prompt (session context + current prompt) could exceed the model's maximum sequence length. Should we implement a truncation strategy? If so, should we crop the oldest or newest parts of the context?

It'd be good to truncate the oldest since we need the more recent context for better response in general. What is the thinking on how you decide when to truncate? Get the actual context length for the model or have another config field for max_context_length? I think it can be in a separate change too.

Error Handling Strategy: In the current implementation, if a round in a conversation fails, it is ignored, and the next round proceeds using the context from the last successful round. Is this the desired behavior? What alternative error handling strategies should be considered for failed rounds in a sequential conversation?

This seems like a good default behavior to go with.

Global Conversation State: The current implementation stores conversation state and context locally within each worker. This is efficient but means that a user session is tied to a specific worker. Should we consider a global conversation state management system that would allow any worker to handle any turn for a given user session? This would increase communication overhead and development efforts but provide more flexibility.

I actually prefer keeping the context local to the worker. I don't think there is a need for global conversation state. Only complicates things unnecessarily.

@achandrasekar
Copy link
Contributor

@vMaroon PTAL if you have time. Would be good to get some feedback on whether this solves your use case.

- request dispatcher supports assign request to a specific worker
- multi-turn chat enhace with load banlanced on both worker and user session level.
- introduced to standardize the lazy loading of inference data. This replaces the previous implementation and provides a cleaner, extensible design for data handling between the data generator, load generator, and API data layers.
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: huaxig
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@huaxig
Copy link
Contributor Author

huaxig commented Oct 31, 2025

Summary for new changes

  • Standardized Lazy Loading: A new LazyLoadInferenceAPIData interface has been introduced to standardize the process of lazy loading inference data. This replaces the previous implementation, which relied on checking for the existence of a get_request method and the request type. This change provides a cleaner and more extensible design across the data generation, load generation, and API data layers.

  • Request Dispatcher Update: The request dispatcher has been updated to ensure that all requests for the same user session are consistently routed to the same worker. This guarantees that the conversational context is correctly maintained across multiple turns.

  • Enhanced Load Balancing: it also improve load balancing at both the worker and user session levels. By distributing user sessions evenly across workers and ensuring that each session has a similar number of rounds, the system can achieve a more balanced and predictable load. This prevents worker overload and ensures fair resource allocation among user sessions.

  • Worker-Specific Queues: A new RequestQueue has been enhanced into Load Generator to manage request channels for each worker. It now uses a prefered_worker_id to dispatch requests to the appropriate worker's queue. this change also provide back-compatible for single global channel, when the number of channel is set to 1, all workers share the single channel.

CC: @jjk-g

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants