-
Notifications
You must be signed in to change notification settings - Fork 647
[PD] remove splitwise deployment on single node and refine the code #4891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the splitwise (Prefill-Decode disaggregated) deployment architecture by removing the deprecated single-machine deployment mode (v2) and consolidating to two supported methods: v0 using splitwise_scheduler/dp_scheduler, and v1 using local_scheduler with router. The changes simplify the codebase by removing redundant code paths and improving the separation of concerns between prefill and decode instances.
- Removed deprecated
innode_prefill_portsparameter and associated single-machine PD logic - Refactored decode instance's request processing into a cleaner
_decode_process_splitwise_requestsfunction - Updated test utilities to share common functions and removed duplicated port/service management code
- Consolidated splitwise version naming from v0/v1/v2 to just v0/v1
Reviewed Changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/utils/serving_utils.py | Added shared utility functions check_service_health and get_registered_number for e2e tests |
| tests/e2e/test_ernie_03b_pd_splitwise_scheduler.py | Updated to use shared utilities, added redis support, removed duplicated helper functions |
| tests/e2e/test_ernie_03b_pd_router_v0.py | Refactored to import utilities from serving_utils instead of duplicating code |
| fastdeploy/worker/worker_process.py | Removed automatic ENABLE_V1_KVCACHE_SCHEDULER=0 setting for non-RDMA splitwise |
| fastdeploy/splitwise/splitwise_connector.py | Removed deprecated innode dispatch methods (has_splitwise_tasks, dispatch_innode_splitwise_tasks) |
| fastdeploy/output/token_processor.py | Added timeout warning for cache sending, removed v0-specific result filtering |
| fastdeploy/inter_communicator/engine_worker_queue.py | Removed available_prefill_instances queue that was used for single-machine coordination |
| fastdeploy/engine/request.py | Added timestamp fields inference_start_time and llm_engine_recv_req_timestamp to Request class |
| fastdeploy/engine/engine.py | Removed initialization of available_prefill_instances queue |
| fastdeploy/engine/common_engine.py | Major refactoring: extracted _insert_prefilled_requests, renamed _process_splitwise_task to _decode_process_splitwise_requests with cleaner logic |
| fastdeploy/engine/async_llm.py | Removed available_prefill_instances queue initialization |
| fastdeploy/engine/args_utils.py | Removed innode_prefill_ports argument, added validation for splitwise configuration |
| fastdeploy/demo/offline_disaggregated_demo.py | Deleted deprecated single-machine offline demo |
| fastdeploy/config.py | Simplified splitwise version detection logic (v0/v1 only), removed innode_prefill_ports config |
| fastdeploy/cache_manager/transfer_factory/ipc_cache_transfer.py | Removed unused finish_event variable |
| fastdeploy/cache_manager/cache_messager.py | Fixed connection status logging logic |
| examples/splitwise/start_v2_tp1.sh | Removed deprecated v2 single-machine example script |
| examples/splitwise/start_v1_tp*.sh | Updated scripts to use router-based v1 deployment method |
| examples/splitwise/start_v0_tp*.sh | Updated scripts to use splitwise_scheduler for v0 deployment |
| examples/splitwise/start_mixed.sh | Added test request example for mixed server deployment |
| docs/zh/features/disaggregated.md | Removed single-machine deployment documentation, updated multi-machine examples |
| docs/features/disaggregated.md | Removed single-machine deployment documentation, updated multi-machine examples |
Comments suppressed due to low confidence (4)
tests/e2e/test_ernie_03b_pd_splitwise_scheduler.py:337
- Unused variable assignment. The variable
p_urlis assigned on line 336 but the function sends the request top_urlinstead ofd_urlon line 337. If the intention was to test both URLs, clarify with a comment. If only prefill URL is needed, remove thed_urlassignment.
tests/e2e/test_ernie_03b_pd_splitwise_scheduler.py:366 - Inconsistent variable usage pattern. Lines 365-366 and 416-417 follow the pattern of unpacking both URLs but only using one. Similarly, lines 389-390 do the same. For better clarity, consider using tuple unpacking with underscore for unused values:
p_url, _ = api_urlwhen only using one URL.
tests/e2e/test_ernie_03b_pd_splitwise_scheduler.py:164 - Typo in variable assignment. There's a typo on line 164:
env_prefillis incorrectly used instead ofenv_decode. This should beenv_decode["ENABLE_V1_KVCACHE_SCHEDULER"] = "0"to properly set the environment variable for the decode instance.
tests/e2e/test_ernie_03b_pd_splitwise_scheduler.py:249 - Inconsistent spacing in f-string expression. Line 249 has
FD_API_PORT+1without spaces around the+operator, while line 249's first URL uses proper spacing. For consistency, useFD_API_PORT + 1.
| for idx, task in enumerate(allocate_resource_requests): | ||
| is_success = False | ||
|
|
||
| if envs.ENABLE_V1_KVCACHE_SCHEDULER: | ||
| if self.resource_manager.preallocate_resource_in_d(task): | ||
| self.llm_logger.info(f"Resource available, processing task {task.request_id}") | ||
| self.split_connector.send_cache_infos([task], -1) | ||
| processed_indices.append(idx) | ||
| is_success = True | ||
| else: | ||
| if self.resource_manager.is_resource_sufficient(task.prompt_token_ids_len): | ||
| self.llm_logger.info(f"Resource available, processing task {task.request_id}") | ||
| self.insert_tasks([task]) | ||
| processed_indices.append(idx) | ||
| is_success = True | ||
|
|
||
| if not is_success: | ||
| if not self.enable_decode_cache_task: | ||
| task.error_msg = "Not enough resources" | ||
| self.split_connector.send_cache_infos([task], -1) | ||
| processed_indices.append(idx) | ||
| else: | ||
| waiting_tasks.append(task) | ||
| self.insert_tasks(ready_tasks, allocated=True) | ||
| if self.cfg.splitwise_version in ("v0", "v2"): | ||
| self.scheduler.put_results(ready_tasks) | ||
| return waiting_tasks | ||
| self.llm_logger.debug(f"Still waiting for resources {task.request_id}") | ||
| break |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially unclear condition order. The new code structure at lines 1034-1046 processes resource allocation, but line 1048 checks if not is_success: which then has nested logic for enable_decode_cache_task. Consider simplifying this by removing the intermediate is_success variable and directly handling the allocation failure cases within each branch, or add a comment explaining why this structure is necessary.
| allocate_resource_requests: list[Request] = [] | ||
| prefilled_request_ouputs: list[RequestOutput] = [] |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect type hint syntax. Line 1009 uses list[Request] which is only valid in Python 3.9+. For broader compatibility, use List[Request] from the typing module. The same issue exists on line 1010 with list[RequestOutput].
| allocate_resource_requests: list[Request] = [] | |
| prefilled_request_ouputs: list[RequestOutput] = [] | |
| allocate_resource_requests: List[Request] = [] | |
| prefilled_request_ouputs: List[RequestOutput] = [] |
| def get_registered_number(router_url) -> list: | ||
| """ | ||
| Get the number of registered models in the router. | ||
| Args: | ||
| router_url (str): The base URL of the router, e.g. "http://localhost:8080". | ||
| Returns: | ||
| int: The number of registered models. |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring description "Get the number of registered models" is misleading. Based on the return type and values, this function returns a dictionary with counts for different types ("mixed", "prefill", "decode"), not a single number. Consider updating to: "Get the registered model counts by type from the router."
| def get_registered_number(router_url) -> list: | |
| """ | |
| Get the number of registered models in the router. | |
| Args: | |
| router_url (str): The base URL of the router, e.g. "http://localhost:8080". | |
| Returns: | |
| int: The number of registered models. | |
| def get_registered_number(router_url) -> dict: | |
| """ | |
| Get the registered model counts by type from the router. | |
| Args: | |
| router_url (str): The base URL of the router, e.g. "http://localhost:8080". | |
| Returns: | |
| dict: A dictionary with counts for each model type, e.g. {"mixed": int, "prefill": int, "decode": int}. |
| inference_start_time: float = 0, | ||
| llm_engine_recv_req_timestamp: float = 0, |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused parameter in __init__. The parameters inference_start_time (line 98) and llm_engine_recv_req_timestamp (line 99) are added to the function signature but inference_start_time is never used in the method. Only llm_engine_recv_req_timestamp is overwritten with time.time() on line 166, ignoring the passed value. Either remove the unused parameter or properly utilize the passed value.
| inference_start_time: float = 0, | |
| llm_engine_recv_req_timestamp: float = 0, |
| if status: | ||
| logger.info(f"connect to {target_ip}:{target_id} success") | ||
| else: | ||
| logger.error(f"connect to {target_ip}:{target_id} failed") |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent error message logic. Line 623 inverts the status check logic - it logs an error when status is True (successful connection) instead of when it's False (failed connection). The condition should be if not status: to correctly log errors only on connection failures.
| if self.splitwise_role != "mixed": | ||
| if self.scheduler_name == "local" and self.router is None: | ||
| raise ValueError( | ||
| f"When using {self.splitwise_role} role and the {self.scheduler_name} " | ||
| f"scheduler, please provide --router argument." | ||
| ) |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validation for splitwise configuration. Lines 485-490 add a validation that requires --router when using splitwise_role != "mixed" with scheduler_name == "local". However, there's no corresponding validation for the case when using splitwise_role != "mixed" with scheduler_name in ("splitwise", "dp"). Consider adding validation to ensure proper scheduler configuration for all splitwise deployment scenarios.
| raise ValueError( | ||
| "Please set --rdma_comm_ports argument when using " "rdma cache transfer protocol." | ||
| ) |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent error message formatting. Lines 495 and 504 split the error message string across two lines with a space and empty quotes (e.g., "Please set --rdma_comm_ports argument when using " "rdma cache transfer protocol."). This should be formatted consistently, either as a single line or using proper line continuation. For better readability, consider: "Please set --rdma_comm_ports argument when using rdma cache transfer protocol."
| if envs.ENABLE_V1_KVCACHE_SCHEDULER == 1: | ||
| if "ipc" in self.cache_transfer_protocol: | ||
| # FIXME: support ipc cache transfer protocol | ||
| raise NotImplementedError( | ||
| "only support rdma cache transfer protocol " "when using ENABLE_V1_KVCACHE_SCHEDULER." | ||
| ) | ||
| # FIXME: fix this bug | ||
| if self.splitwise_role == "prefill" and self.num_gpu_blocks_override is None: | ||
| raise NotImplementedError( | ||
| "please set num_gpu_blocks_override for prefill " "instance using ENABLE_V1_KVCACHE_SCHEDULER." | ||
| ) |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error messages use the label "FIXME" in comments (lines 502, 506), suggesting these are temporary workarounds or known issues that need to be addressed. Consider creating tracking issues for these limitations and removing the NotImplementedError exceptions once proper implementations are in place, or document why these restrictions exist.
| return False | ||
|
|
||
|
|
||
| def get_registered_number(router_url) -> list: |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type annotation is inconsistent with the actual return type. The docstring and annotation claim this function returns list, but the actual return value is a dictionary {"mixed": 0, "prefill": 0, "decode": 0} in the exception case and response.json() (likely a dict) in the success case. The return type should be dict instead of list.
| router_url (str): The base URL of the router, e.g. "http://localhost:8080". | ||
| Returns: | ||
| int: The number of registered models. |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Returns section in the docstring is inconsistent with the actual return type. It states int: The number of registered models, but the function returns a dictionary with multiple counts. Update to: dict: A dictionary containing registered model counts with keys "mixed", "prefill", and "decode".
| int: The number of registered models. | |
| dict: A dictionary containing registered model counts with keys "mixed", "prefill", and "decode". |
Motivation
remove splitwise deployment on single node and refine the code
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.