-
Notifications
You must be signed in to change notification settings - Fork 657
[TSP] Support qwen3 moe tsp + cudagraph #4871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TSP] Support qwen3 moe tsp + cudagraph #4871
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the RMSNorm layer to consistently return a tuple of (output, residual_out) in all cases, and updates all model files to handle this new API. Additionally, it includes bug fixes for environment variable handling and quantization logic.
- Standardizes RMSNorm to always return a tuple regardless of whether
residual_inputis provided - Updates all model files to unpack the tuple return value using
[0]indexing - Fixes environment variable type conversion for stop sequence configuration
- Corrects quantization logic for handling None values in
output_dim
Reviewed Changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/model_executor/layers/normalization.py | Refactors RMSNorm.forward() to always return tuple (output, residual_out) |
| fastdeploy/model_executor/models/qwen3moe.py | Updates RMSNorm calls to use new API with tuple unpacking |
| fastdeploy/model_executor/models/qwen3.py | Updates RMSNorm calls to extract first element from returned tuple |
| fastdeploy/model_executor/models/qwen2.py | Updates RMSNorm calls and input_layernorm parameter names |
| fastdeploy/model_executor/models/qwen2_5_vl/qwen2_5_vl.py | Updates norm call to use tuple unpacking |
| fastdeploy/model_executor/models/paddleocr_vl/paddleocr_vl.py | Updates norm call to use tuple unpacking |
| fastdeploy/model_executor/models/gpt_oss.py | Updates RMSNorm calls to use new API |
| fastdeploy/model_executor/models/glm4_moe.py | Updates q_norm and k_norm calls with tuple unpacking |
| fastdeploy/model_executor/models/ernie4_5_vl/ernie4_5_vl_moe.py | Updates RMSNorm calls to use new API |
| fastdeploy/model_executor/models/ernie4_5_mtp.py | Updates multiple norm calls to use tuple unpacking |
| fastdeploy/model_executor/models/ernie4_5_moe.py | Updates RMSNorm calls and removes forward_meta from post_attention_layernorm |
| fastdeploy/model_executor/models/deepseek_v3.py | Updates layernorm and RMSNorm calls to use new API |
| fastdeploy/model_executor/layers/quantization/block_wise_fp8.py | Adds None check before negating output_dim boolean |
| fastdeploy/model_executor/layers/linear.py | Fixes reshape dimension and output_dim assignment |
| fastdeploy/input/text_processor.py | Removes unused return_tensors parameter |
| fastdeploy/envs.py | Converts environment variables to int type at source |
| fastdeploy/entrypoints/engine_client.py | Removes redundant int() conversions |
| fastdeploy/engine/engine.py | Removes redundant int() conversions |
| fastdeploy/config.py | Removes redundant int() conversions |
Comments suppressed due to low confidence (2)
fastdeploy/model_executor/models/qwen3moe.py:206
- The
post_attention_layernormcall is missing theforward_metaparameter. Looking at line 196-198 whereinput_layernormis called withforward_meta, thepost_attention_layernormshould also include this parameter for consistency and proper parallel execution handling.
hidden_states, residual = self.post_attention_layernorm(hidden_states, residual)
fastdeploy/model_executor/layers/normalization.py:1
- The docstring states 'If
residual_inputis None, returns the normalized output tensor' and 'Ifresidual_inputis provided, returns a tuple', but the implementation now always returns a tuple(out, residual_out)regardless of whetherresidual_inputis None. The documentation should be updated to reflect that this method always returns a tuple.
"""
…into support_qwen3_moe_tsp
zhupengyang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
carryyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
qingqing01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…into support_qwen3_moe_tsp
0082b1a
…into support_qwen3_moe_tsp
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Motivation
#4688 和 #4836 支持了 ernie4.5 moe XPU 下 的 TSP-EP 混合并行,但仅验证了ernie4.5 moe 模型,且实现不够通用。此 RP 支持并验证了 Qwen3 moe 的 GPU + CUDAGraph 模式,对其他 MoE 模型同样适用。
Modifications
Usage or Command
DP2TP4EP8
Accuracy Tests
无。
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.