Skip to content

Conversation

@nic-6443
Copy link
Member

@nic-6443 nic-6443 commented Oct 19, 2025

Description

Run jaeger in local by https://www.jaegertracing.io/docs/2.11/getting-started/#all-in-one

image screenshot-2025-10-24_12-32-41

Which issue(s) this PR fixes:

Fixes #

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

Signed-off-by: Nic <qianyong@api7.ai>
Signed-off-by: Nic <qianyong@api7.ai>
f
Signed-off-by: Nic <qianyong@api7.ai>
@Revolyssup Revolyssup marked this pull request as ready for review October 23, 2025 14:21
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Oct 23, 2025
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Oct 23, 2025
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Oct 23, 2025
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Oct 24, 2025
@Revolyssup Revolyssup requested review from membphis, moonming and nic-chen and removed request for membphis October 25, 2025 20:19
Revolyssup
Revolyssup previously approved these changes Oct 25, 2025
@Revolyssup
Copy link
Contributor

Revolyssup commented Oct 27, 2025

Benchmark

  1. Add route
 apisix git:(nic/opentelemetry) ✗ curl "<http://127.0.0.1:9180/apisix/admin/routes>" -X PUT \\
  -H "X-API-KEY: ${admin_key}" \\
  -d '{
    "id": "otel-tracing-route",
    "uri": "/headers",
    "plugins": {
      "opentelemetry": {
        "sampler": {
          "name": "always_on"
        }
      }
    },
    "upstream": {
      "type": "roundrobin",
      "nodes": {
        "127.0.0.1:8080": 1
      }
    }
  }'

Without instrumentation

  • Run wrk
 wrk -t4 -c100 -d30s http://localhost:9080/headers

Running 30s test @ http://localhost:9080/headers
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.68ms    8.44ms 187.90ms   93.17%
    Req/Sec     3.78k     0.87k    6.71k    71.82%
  451166 requests in 30.07s, 148.44MB read
Requests/sec:  15003.94
Transfer/sec:      4.94MB

With instrumentation

  • Run wrk
Running 30s test @ http://localhost:9080/headers
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    28.92ms   28.13ms 578.74ms   90.34%
    Req/Sec     1.01k   306.32     2.11k    69.74%
  120341 requests in 30.08s, 39.59MB read
Requests/sec:   4001.25
Transfer/sec:      1.32MB
  • Check traces
screenshot-2025-10-27_12-29-30

📊 Summary:

Enabling instrumentation caused roughly a 3.8× increase in latency and a 73% drop in throughput, indicating significant overhead from telemetry collection and reporting.

opentelemetry plugin disabled

 apisix git:(nic/opentelemetry) ✗ curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${admin_key}" \
  -d '{
    "id": "otel-tracing-route",
    "uri": "/headers",
    "upstream": {
      "type": "roundrobin",
      "nodes": {
        "127.0.0.1:8080": 1
      }
    }

Without instrumentation

apisix git:(master)  wrk -t4 -c100 -d30s http://localhost:9080/headers

Running 30s test @ http://localhost:9080/headers
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.97ms    5.43ms 198.60ms   96.88%
    Req/Sec    10.67k     1.60k   13.13k    80.67%
  1276020 requests in 30.06s, 335.86MB read
Requests/sec:  42443.85
Transfer/sec:     11.17MB

With instrumentation

 apisix git:(nic/opentelemetry)  wrk -t4 -c100 -d30s http://localhost:9080/headers

Running 30s test @ http://localhost:9080/headers
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.07ms    4.73ms 135.56ms   96.04%
    Req/Sec    10.02k     1.65k   13.20k    79.83%
  1198618 requests in 30.08s, 315.49MB read
Requests/sec:  39843.43
Transfer/sec:     10.49MB

📊 Summary:

Enabling the OpenTelemetry plugin introduced a small overhead (~6%), with a minor increase in latency (~3%) and a slight reduction in throughput. Overall, instrumentation impact here is minimal

Interpretation:

The instrumentation itself adds negligible overhead when inactive. The major slowdown observed earlier (≈68%) only occurs when the OpenTelemetry plugin is actually enabled and exporting traces, not merely when the instrumentation code exists. This is coming majorly from inject_core_spans function.

self.end_time = 0
self.kind = kind
self.attributes = self.attributes or {}
self.children = self.children or {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the table attributes and children, they can be reused
we can run the flame graph, to confirm if we need this optimize


function _M.set_status(self, code, message)
code = span_status.validate(code)
local status = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    local status = self.status
    if not status then
        status = {
            code = code,
            message = ""
        }
        self.status = status
    else
        status.code = code
    end

    if code == span_status.ERROR then
        status.message = message
    end



function _M.set_attributes(self, ...)
for _, attr in ipairs({ ... }) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current way is slow, make a try with new way:

for ... select('#' )

self.end_time = util.time_nano()
end

function _M.release(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same as current, the table pool will call table.clear auto

Suggested change
function _M.release(self)
function _M.release(self)
tablepool.release(pool_name, self)
end



function _M.clear(self)
for i = 1, self._n do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can call table.clear(self._data), which is much easier

end

function _M.finish_all_spans(code, message)
if not ngx.ctx._apisix_spans then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local apisix_spans = ngx.ctx._apisix_spans
if not apisix_spans then
    return
end

for _, sp in pairs(apisix_spans) do
    if code then
        sp:set_status(code, message)
    end
    sp:finish()
end

@Revolyssup Revolyssup requested a review from membphis October 28, 2025 07:55
return false, "failed to create radixtree router: " .. err
end
radixtree_router_ver = ssl_certificates.conf_version
tracer.finish_current_span()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly, this API is confusing. The call to end a span should be something like span:finish(). Currently, using tracer.finish_current_span() implies that spans are managed by a context record within the tracer. This creates confusion for me—could it lead to misuse or conflicts when handling multiple requests in parallel or when a single request contains yield operations? Especially since we might optimize it in the future to use some kind of table pooling mechanism.

This is a concern. While I initially suspect it might not be an issue in single-threaded Nginx, we should avoid this confusion in the API design. It requires programmers to be thoroughly familiar with the OpenResty programming model, understanding the extent to which data is shared and how conflicts might arise. This imposes an additional explanation burden on us.

This is from a DX perspective. Technically, we may need to rethink how the stack is used to properly connect all spans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same issue when I review this code first time

APISIX will encounter an error if there are concurrent requests.

Image

New way(should same as @bzp2010)

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants