Skip to content

Conversation

@changminbark
Copy link
Contributor

PR Template

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test

/kind feature

/kind flake

What this PR does / why we need it:
This PRs introduces a way of producing constant load for concurrency per stage. This is needed to understand how the system performs under constant load. This is achieved by capping the max concurrency of the workers for every stage to achieve the desired level of concurrency.

Which issue(s) this PR fixes:

Fixes #252

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

The load generator now has the option to generate constant load for a specific level of concurrency in each stage (workers with specific max concurrency values to achieve the level of concurrency for each stage). Graphs of the metrics against the level of concurrency are also generated.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Testing

Testing was done using the config.yml file shown below and the necessary services (like vLLM serving HuggingFaceTB/SmolLM2-135M-Instruct and local prometheus).

Click to expand functional test output

config.yaml

stage_0_lifecycle_metrics.json
stage_1_lifecycle_metrics.json
summary_lifecycle_metrics.json
summary_prometheus_metrics.json

latency_vs_concurrency throughput_vs_concurrency throughput_vs_latency

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 30, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 30, 2025
@changminbark
Copy link
Contributor Author

/assign @achandrasekar

Copy link
Contributor

@jjk-g jjk-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this!

@changminbark
Copy link
Contributor Author

Latest Test:

Validation test for loadgen config:

Misconfigured Yaml

load:
  type: constant
  stages:
  - rate: 50.0
    duration: 1
    num_requests: 50
    concurrency_level: 6
  - rate: 25.0
    duration: 1
    num_requests: 25
    concurrency_level: 2
api: 
  type: completion
  streaming: true
server:
  type: vllm
  model_name: HuggingFaceTB/SmolLM2-135M-Instruct
  base_url: http://0.0.0.0:8000
  ignore_eos: true
tokenizer:
  pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct
data:
  type: shareGPT
metrics:
  type: prometheus
  prometheus:
    url: http://localhost:9090
    scrape_interval: 15
report:
  request_lifecycle:
    summary: true
    per_stage: true
    per_request: false
  prometheus:
    summary: true
    per_stage: false
python3 inference_perf/main.py -c config.yml
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2025-10-30 14:48:15,299 - inference_perf.config - INFO - Using configuration from: config.yml
Traceback (most recent call last):
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/inference_perf/main.py", line 332, in <module>
    main_cli()
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/inference_perf/main.py", line 118, in main_cli
    config = read_config(args.config_file)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/inference_perf/config.py", line 298, in read_config
    converted_stages.append(StandardLoadStage(**stage))
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/venv/lib/python3.12/site-packages/pydantic/main.py", line 253, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 2 validation errors for StandardLoadStage
num_requests
  Input should be None [type=none_required, input_value=50, input_type=int]
    For further information visit https://errors.pydantic.dev/2.11/v/none_required
concurrency_level
  Input should be None [type=none_required, input_value=6, input_type=int]
    For further information visit https://errors.pydantic.dev/2.11/v/none_required

Functional test (running inference)

config.yaml

stage_0_lifecycle_metrics.json
stage_1_lifecycle_metrics.json
summary_lifecycle_metrics.json
summary_prometheus_metrics.json

latency_vs_concurrency throughput_vs_concurrency throughput_vs_latency

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: changminbark
Once this PR has been reviewed and has the lgtm label, please ask for approval from achandrasekar. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Option to Generate Load Based on Concurrency

4 participants