Skip to content

Conversation

@juncaipeng
Copy link
Collaborator

Motivation

remove splitwise deployment on single node and refine the code

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings November 7, 2025 09:49
@paddle-bot
Copy link

paddle-bot bot commented Nov 7, 2025

Thanks for your contribution!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the splitwise (Prefill-Decode disaggregated) deployment architecture by removing the deprecated single-machine deployment mode (v2) and consolidating to two supported methods: v0 using splitwise_scheduler/dp_scheduler, and v1 using local_scheduler with router. The changes simplify the codebase by removing redundant code paths and improving the separation of concerns between prefill and decode instances.

  • Removed deprecated innode_prefill_ports parameter and associated single-machine PD logic
  • Refactored decode instance's request processing into a cleaner _decode_process_splitwise_requests function
  • Updated test utilities to share common functions and removed duplicated port/service management code
  • Consolidated splitwise version naming from v0/v1/v2 to just v0/v1

Reviewed Changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/e2e/utils/serving_utils.py Added shared utility functions check_service_health and get_registered_number for e2e tests
tests/e2e/test_ernie_03b_pd_splitwise_scheduler.py Updated to use shared utilities, added redis support, removed duplicated helper functions
tests/e2e/test_ernie_03b_pd_router_v0.py Refactored to import utilities from serving_utils instead of duplicating code
fastdeploy/worker/worker_process.py Removed automatic ENABLE_V1_KVCACHE_SCHEDULER=0 setting for non-RDMA splitwise
fastdeploy/splitwise/splitwise_connector.py Removed deprecated innode dispatch methods (has_splitwise_tasks, dispatch_innode_splitwise_tasks)
fastdeploy/output/token_processor.py Added timeout warning for cache sending, removed v0-specific result filtering
fastdeploy/inter_communicator/engine_worker_queue.py Removed available_prefill_instances queue that was used for single-machine coordination
fastdeploy/engine/request.py Added timestamp fields inference_start_time and llm_engine_recv_req_timestamp to Request class
fastdeploy/engine/engine.py Removed initialization of available_prefill_instances queue
fastdeploy/engine/common_engine.py Major refactoring: extracted _insert_prefilled_requests, renamed _process_splitwise_task to _decode_process_splitwise_requests with cleaner logic
fastdeploy/engine/async_llm.py Removed available_prefill_instances queue initialization
fastdeploy/engine/args_utils.py Removed innode_prefill_ports argument, added validation for splitwise configuration
fastdeploy/demo/offline_disaggregated_demo.py Deleted deprecated single-machine offline demo
fastdeploy/config.py Simplified splitwise version detection logic (v0/v1 only), removed innode_prefill_ports config
fastdeploy/cache_manager/transfer_factory/ipc_cache_transfer.py Removed unused finish_event variable
fastdeploy/cache_manager/cache_messager.py Fixed connection status logging logic
examples/splitwise/start_v2_tp1.sh Removed deprecated v2 single-machine example script
examples/splitwise/start_v1_tp*.sh Updated scripts to use router-based v1 deployment method
examples/splitwise/start_v0_tp*.sh Updated scripts to use splitwise_scheduler for v0 deployment
examples/splitwise/start_mixed.sh Added test request example for mixed server deployment
docs/zh/features/disaggregated.md Removed single-machine deployment documentation, updated multi-machine examples
docs/features/disaggregated.md Removed single-machine deployment documentation, updated multi-machine examples
Comments suppressed due to low confidence (4)

tests/e2e/test_ernie_03b_pd_splitwise_scheduler.py:337

  • Unused variable assignment. The variable p_url is assigned on line 336 but the function sends the request to p_url instead of d_url on line 337. If the intention was to test both URLs, clarify with a comment. If only prefill URL is needed, remove the d_url assignment.
    tests/e2e/test_ernie_03b_pd_splitwise_scheduler.py:366
  • Inconsistent variable usage pattern. Lines 365-366 and 416-417 follow the pattern of unpacking both URLs but only using one. Similarly, lines 389-390 do the same. For better clarity, consider using tuple unpacking with underscore for unused values: p_url, _ = api_url when only using one URL.
    tests/e2e/test_ernie_03b_pd_splitwise_scheduler.py:164
  • Typo in variable assignment. There's a typo on line 164: env_prefill is incorrectly used instead of env_decode. This should be env_decode["ENABLE_V1_KVCACHE_SCHEDULER"] = "0" to properly set the environment variable for the decode instance.
    tests/e2e/test_ernie_03b_pd_splitwise_scheduler.py:249
  • Inconsistent spacing in f-string expression. Line 249 has FD_API_PORT+1 without spaces around the + operator, while line 249's first URL uses proper spacing. For consistency, use FD_API_PORT + 1.

Comment on lines +1032 to +1055
for idx, task in enumerate(allocate_resource_requests):
is_success = False

if envs.ENABLE_V1_KVCACHE_SCHEDULER:
if self.resource_manager.preallocate_resource_in_d(task):
self.llm_logger.info(f"Resource available, processing task {task.request_id}")
self.split_connector.send_cache_infos([task], -1)
processed_indices.append(idx)
is_success = True
else:
if self.resource_manager.is_resource_sufficient(task.prompt_token_ids_len):
self.llm_logger.info(f"Resource available, processing task {task.request_id}")
self.insert_tasks([task])
processed_indices.append(idx)
is_success = True

if not is_success:
if not self.enable_decode_cache_task:
task.error_msg = "Not enough resources"
self.split_connector.send_cache_infos([task], -1)
processed_indices.append(idx)
else:
waiting_tasks.append(task)
self.insert_tasks(ready_tasks, allocated=True)
if self.cfg.splitwise_version in ("v0", "v2"):
self.scheduler.put_results(ready_tasks)
return waiting_tasks
self.llm_logger.debug(f"Still waiting for resources {task.request_id}")
break
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially unclear condition order. The new code structure at lines 1034-1046 processes resource allocation, but line 1048 checks if not is_success: which then has nested logic for enable_decode_cache_task. Consider simplifying this by removing the intermediate is_success variable and directly handling the allocation failure cases within each branch, or add a comment explaining why this structure is necessary.

Copilot uses AI. Check for mistakes.
Comment on lines +1009 to +1010
allocate_resource_requests: list[Request] = []
prefilled_request_ouputs: list[RequestOutput] = []
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect type hint syntax. Line 1009 uses list[Request] which is only valid in Python 3.9+. For broader compatibility, use List[Request] from the typing module. The same issue exists on line 1010 with list[RequestOutput].

Suggested change
allocate_resource_requests: list[Request] = []
prefilled_request_ouputs: list[RequestOutput] = []
allocate_resource_requests: List[Request] = []
prefilled_request_ouputs: List[RequestOutput] = []

Copilot uses AI. Check for mistakes.
Comment on lines +128 to +136
def get_registered_number(router_url) -> list:
"""
Get the number of registered models in the router.
Args:
router_url (str): The base URL of the router, e.g. "http://localhost:8080".
Returns:
int: The number of registered models.
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring description "Get the number of registered models" is misleading. Based on the return type and values, this function returns a dictionary with counts for different types ("mixed", "prefill", "decode"), not a single number. Consider updating to: "Get the registered model counts by type from the router."

Suggested change
def get_registered_number(router_url) -> list:
"""
Get the number of registered models in the router.
Args:
router_url (str): The base URL of the router, e.g. "http://localhost:8080".
Returns:
int: The number of registered models.
def get_registered_number(router_url) -> dict:
"""
Get the registered model counts by type from the router.
Args:
router_url (str): The base URL of the router, e.g. "http://localhost:8080".
Returns:
dict: A dictionary with counts for each model type, e.g. {"mixed": int, "prefill": int, "decode": int}.

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +99
inference_start_time: float = 0,
llm_engine_recv_req_timestamp: float = 0,
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused parameter in __init__. The parameters inference_start_time (line 98) and llm_engine_recv_req_timestamp (line 99) are added to the function signature but inference_start_time is never used in the method. Only llm_engine_recv_req_timestamp is overwritten with time.time() on line 166, ignoring the passed value. Either remove the unused parameter or properly utilize the passed value.

Suggested change
inference_start_time: float = 0,
llm_engine_recv_req_timestamp: float = 0,

Copilot uses AI. Check for mistakes.
Comment on lines +623 to 626
if status:
logger.info(f"connect to {target_ip}:{target_id} success")
else:
logger.error(f"connect to {target_ip}:{target_id} failed")
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent error message logic. Line 623 inverts the status check logic - it logs an error when status is True (successful connection) instead of when it's False (failed connection). The condition should be if not status: to correctly log errors only on connection failures.

Copilot uses AI. Check for mistakes.
Comment on lines +485 to +490
if self.splitwise_role != "mixed":
if self.scheduler_name == "local" and self.router is None:
raise ValueError(
f"When using {self.splitwise_role} role and the {self.scheduler_name} "
f"scheduler, please provide --router argument."
)
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation for splitwise configuration. Lines 485-490 add a validation that requires --router when using splitwise_role != "mixed" with scheduler_name == "local". However, there's no corresponding validation for the case when using splitwise_role != "mixed" with scheduler_name in ("splitwise", "dp"). Consider adding validation to ensure proper scheduler configuration for all splitwise deployment scenarios.

Copilot uses AI. Check for mistakes.
Comment on lines +494 to +496
raise ValueError(
"Please set --rdma_comm_ports argument when using " "rdma cache transfer protocol."
)
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent error message formatting. Lines 495 and 504 split the error message string across two lines with a space and empty quotes (e.g., "Please set --rdma_comm_ports argument when using " "rdma cache transfer protocol."). This should be formatted consistently, either as a single line or using proper line continuation. For better readability, consider: "Please set --rdma_comm_ports argument when using rdma cache transfer protocol."

Copilot uses AI. Check for mistakes.
Comment on lines +500 to +510
if envs.ENABLE_V1_KVCACHE_SCHEDULER == 1:
if "ipc" in self.cache_transfer_protocol:
# FIXME: support ipc cache transfer protocol
raise NotImplementedError(
"only support rdma cache transfer protocol " "when using ENABLE_V1_KVCACHE_SCHEDULER."
)
# FIXME: fix this bug
if self.splitwise_role == "prefill" and self.num_gpu_blocks_override is None:
raise NotImplementedError(
"please set num_gpu_blocks_override for prefill " "instance using ENABLE_V1_KVCACHE_SCHEDULER."
)
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error messages use the label "FIXME" in comments (lines 502, 506), suggesting these are temporary workarounds or known issues that need to be addressed. Consider creating tracking issues for these limitations and removing the NotImplementedError exceptions once proper implementations are in place, or document why these restrictions exist.

Copilot uses AI. Check for mistakes.
return False


def get_registered_number(router_url) -> list:
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type annotation is inconsistent with the actual return type. The docstring and annotation claim this function returns list, but the actual return value is a dictionary {"mixed": 0, "prefill": 0, "decode": 0} in the exception case and response.json() (likely a dict) in the success case. The return type should be dict instead of list.

Copilot uses AI. Check for mistakes.
router_url (str): The base URL of the router, e.g. "http://localhost:8080".
Returns:
int: The number of registered models.
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Returns section in the docstring is inconsistent with the actual return type. It states int: The number of registered models, but the function returns a dictionary with multiple counts. Update to: dict: A dictionary containing registered model counts with keys "mixed", "prefill", and "decode".

Suggested change
int: The number of registered models.
dict: A dictionary containing registered model counts with keys "mixed", "prefill", and "decode".

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant