- 
                Notifications
    You must be signed in to change notification settings 
- Fork 45
Feat: Add user session to support Multi-turn chat (#179) #257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Testingthe result from   | 
| Thanks for sending this out! My take on the open questions: 
 It'd be good to truncate the oldest since we need the more recent context for better response in general. What is the thinking on how you decide when to truncate? Get the actual context length for the model or have another config field for max_context_length? I think it can be in a separate change too. 
 This seems like a good default behavior to go with. 
 I actually prefer keeping the context local to the worker. I don't think there is a need for global conversation state. Only complicates things unnecessarily. | 
| @vMaroon PTAL if you have time. Would be good to get some feedback on whether this solves your use case. | 
- request dispatcher supports assign request to a specific worker - multi-turn chat enhace with load banlanced on both worker and user session level. - introduced to standardize the lazy loading of inference data. This replaces the previous implementation and provides a cleaner, extensible design for data handling between the data generator, load generator, and API data layers.
| [APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: huaxig The full list of commands accepted by this bot can be found here. 
Needs approval from an approver in each of these files:
 Approvers can indicate their approval by writing  | 
| Summary for new changes
 CC: @jjk-g | 
Summary
Implementation
User Session Management: Involves a new LocalUserSession class manages the context of a conversation. It ensures that each request will include context from previous conversation. there is only one request per user session is processed at a time, maintaining the conversational state.
shared_prefix_datagenEnhancement: The shared_prefix_datagen can now group prompts into user sessions. When the newgroup_as_user_sessionflag is enabled in the configuration, each group of prompts simulates a multi-turn conversation (to support variety of QPS, prompts in each group will be treat as an infinity loop). The shared prefix acts as the initial system prompt, and subsequent prompts in the group are treated as conversational turns. In mp mode (multi workers), each worker will hold a ISOLATED group of user sessions to avoid overhead of communication for conversation syncing.API and Configuration Changes:
to_payloadmethod in the API data classes is now asynchronous to support the new user session logic.process_failuremethod has been added to the base API data class to gracefully handle request failures and ensure the user session context is managed correctly. it also providesInferenceInfowhen request failed.group_as_user_sessionboolean flag is added to theshared_prefixdata configuration to enable this new functionality.Open Question
Context Length Management: The context for a user session grows with each turn. With a high QPS, the combined prompt (session context + current prompt) could exceed the model's maximum sequence length. Should we implement a truncation strategy? If so, should we crop the oldest or newest parts of the context?
Error Handling Strategy: In the current implementation, if a round in a conversation fails, it is ignored, and the next round proceeds using the context from the last successful round. Is this the desired behavior? What alternative error handling strategies should be considered for failed rounds in a sequential conversation?
Global Conversation State: The current implementation stores conversation state and context locally within each worker. This is efficient but means that a user session is tied to a specific worker. Should we consider a global conversation state management system that would allow any worker to handle any turn for a given user session? This would increase communication overhead and development efforts but provide more flexibility.