Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 6 additions & 44 deletions docs/features/disaggregated.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,48 +29,6 @@ In multi-instance scenarios, each incoming request needs to be assigned to diffe

## Usage Instructions

### Single-machine Disaggregated Deployment

#### Online Inference Service
Use the following commands for service deployment:

**Prefill Instance**

```bash
export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \
--port 8180 --metrics-port 8181 \
--engine-worker-queue-port 8182 \
--cache-queue-port 8183 \
--tensor-parallel-size 4 \
--quantization wint4 \
--splitwise-role "prefill"
```

**Decode Instance**

```bash
export FD_LOG_DIR="log_decode"
export CUDA_VISIBLE_DEVICES=4,5,6,7
# Note: innode-prefill-ports should specify the engine-worker-queue-port of the Prefill service
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port 8186 \
--cache-queue-port 8187 \
--tensor-parallel-size 4 \
--quantization wint4 \
--innode-prefill-ports 8182 \
--splitwise-role "decode"
```

Note: When requesting single-machine PD disaggregated service, **users should request the Decode service's port**.

#### Offline Inference Service
Refer to the example code `offline_disaggregated_demo.py` in the `fastdeploy/demo` directory for offline inference service deployment.

### Multi-machine Disaggregated Deployment

#### Prerequisite: Redis
Expand Down Expand Up @@ -118,12 +76,14 @@ For multi-machine deployment, confirm that the NIC supports RDMA and that all no
```bash
export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3
export ENABLE_V1_KVCACHE_SCHEDULER=0
echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \
--port 8180 --metrics-port 8181 \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--cache-queue-port 8183 \
--tensor-parallel-size 4 \
Expand All @@ -143,12 +103,14 @@ python -m fastdeploy.entrypoints.openai.api_server \
```bash
export FD_LOG_DIR="log_decode"
export CUDA_VISIBLE_DEVICES=4,5,6,7
export ENABLE_V1_KVCACHE_SCHEDULER=0
echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \
--port 8184 --metrics-port 8185 \
--port 8184 \
--metrics-port 8185 \
--engine-worker-queue-port 8186 \
--cache-queue-port 8187 \
--tensor-parallel-size 4 \
Expand Down
45 changes: 2 additions & 43 deletions docs/zh/features/disaggregated.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,49 +29,6 @@

## 使用说明

### 单机分离式部署

#### 在线推理服务
使用如下命令进行服务部署

**prefill 实例**

```bash
export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \
--port 8180 --metrics-port 8181 \
--engine-worker-queue-port 8182 \
--cache-queue-port 8183 \
--tensor-parallel-size 4 \
--quantization wint4 \
--splitwise-role "prefill"
```

**decode 实例**

```bash
export FD_LOG_DIR="log_decode"
export CUDA_VISIBLE_DEVICES=4,5,6,7
# 注意innode-prefill-ports指定为Prefill服务的engine-worker-queue-port
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port 8186 \
--cache-queue-port 8187 \
--tensor-parallel-size 4 \
--quantization wint4 \
--innode-prefill-ports 8182 \
--splitwise-role "decode"
```

注意在请求单机PD分离服务时,**用户需请求Decode服务的端口**。

#### 离线推理服务

参考`fastdeploy/demo` 目录下 `offline_disaggregated_demo.py` 示例代码,进行离线推理服务部署

### 多机分离式部署

#### 前置依赖 Redis
Expand Down Expand Up @@ -120,6 +77,7 @@ sudo systemctl start redis

export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3
export ENABLE_V1_KVCACHE_SCHEDULER=0
echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
Expand All @@ -146,6 +104,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
```bash
export FD_LOG_DIR="log_decode"
export CUDA_VISIBLE_DEVICES=4,5,6,7
export ENABLE_V1_KVCACHE_SCHEDULER=0
echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
Expand Down
19 changes: 17 additions & 2 deletions examples/splitwise/start_mixed.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
#!/bin/bash
set -e

# Test mixed server + router

wait_for_health() {
local server_port=$1
while true; do
Expand All @@ -16,7 +18,6 @@ wait_for_health() {

# prepare environment
MODEL_NAME="PaddlePaddle/ERNIE-4.5-0.3B-Paddle"
# MODEL_NAME="baidu/ERNIE-4.5-21B-A3B-Paddle"

export FD_DEBUG=1
export ENABLE_V1_KVCACHE_SCHEDULER=0
Expand Down Expand Up @@ -51,7 +52,7 @@ nohup python -m fastdeploy.entrypoints.openai.api_server \
2>&1 >${FD_LOG_DIR}/nohup &
sleep 1

wait_for_health 8100
# wait_for_health 8100

# start modelserver 1
export CUDA_VISIBLE_DEVICES=1
Expand All @@ -69,3 +70,17 @@ nohup python -m fastdeploy.entrypoints.openai.api_server \
2>&1 >${FD_LOG_DIR}/nohup &

wait_for_health 8200


# send request
sleep 10 # make sure server is registered to router
port=9000
curl -X POST "http://0.0.0.0:${port}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "hello"}
],
"max_tokens": 20,
"stream": true
}'
74 changes: 59 additions & 15 deletions examples/splitwise/start_v0_tp1.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
set -e

# Test splitwise deployment
# v0 requires prefill and decode in one node and it uses local scheduler
# v1 supports prefill and decode in multi node and it uses splitwise scheduler
# v2 supports prefill and decode in multi node and it uses router and local scheduler
# There are two methods for splitwise deployment:
# v0: using splitwise_scheduler or dp_scheduler
# v1: using local_scheduler + router

wait_for_health() {
local server_port=$1
Expand All @@ -19,40 +19,64 @@ wait_for_health() {
done
}

# prepare environment
MODEL_NAME="PaddlePaddle/ERNIE-4.5-0.3B-Paddle"
# MODEL_NAME="baidu/ERNIE-4.5-21B-A3B-Paddle"
aistudio download --model ${MODEL_NAME}

export FD_DEBUG=1
export ENABLE_V1_KVCACHE_SCHEDULER=1
export KVCACHE_GDRCOPY_FLUSH_ENABLE=1

SCRIPT_PATH=$(readlink -f "$0")
SCRIPT_DIR=$(dirname "$SCRIPT_PATH")
export $(bash ${SCRIPT_DIR}/../../scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS:${KVCACHE_RDMA_NICS}"
if [ -z "${KVCACHE_RDMA_NICS}" ]; then
echo "KVCACHE_RDMA_NICS is empty, please check the output of get_rdma_nics.sh"
exit 1
fi

unset http_proxy && unset https_proxy
rm -rf log_*

# start redis
if ! redis-cli ping &>/dev/null; then
echo "Redis is not running. Starting redis-server..."
redis-server --daemonize yes
sleep 1
else
echo "Redis is already running."
fi
sleep 1

# start prefill
export CUDA_VISIBLE_DEVICES=0
export FD_LOG_DIR="log_prefill"
mkdir -p ${FD_LOG_DIR}

export CUDA_VISIBLE_DEVICES=0
export FD_DEBUG=1
export ENABLE_V1_KVCACHE_SCHEDULER=0

nohup python -m fastdeploy.entrypoints.openai.api_server \
--model ${MODEL_NAME} \
--port 8100 \
--metrics-port 8101 \
--engine-worker-queue-port 8102 \
--cache-queue-port 8103 \
--max-model-len 32768 \
--num-gpu-blocks-override 1000 \
--splitwise-role "prefill" \
--cache-transfer-protocol "rdma" \
--rdma-comm-ports 8104 \
--pd-comm-port 8105 \
--scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-ttl 9000 \
2>&1 >${FD_LOG_DIR}/nohup &
wait_for_health 8100
# wait_for_health 8100

# start decode
export CUDA_VISIBLE_DEVICES=1
export FD_LOG_DIR="log_decode"
mkdir -p ${FD_LOG_DIR}

export CUDA_VISIBLE_DEVICES=1
export FD_DEBUG=1
export ENABLE_V1_KVCACHE_SCHEDULER=0

nohup python -m fastdeploy.entrypoints.openai.api_server \
--model ${MODEL_NAME} \
--port 9000 \
Expand All @@ -61,6 +85,26 @@ nohup python -m fastdeploy.entrypoints.openai.api_server \
--cache-queue-port 9003 \
--max-model-len 32768 \
--splitwise-role "decode" \
--innode-prefill-ports 8102 \
--cache-transfer-protocol "rdma" \
--rdma-comm-ports 9004 \
--pd-comm-port 9005 \
--scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-ttl 9000 \
2>&1 >${FD_LOG_DIR}/nohup &
wait_for_health 9000


# send request
sleep 10 # make sure server is registered to router
port=9000
curl -X POST "http://0.0.0.0:${port}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "hello"}
],
"max_tokens": 20,
"stream": true
}'
Loading
Loading