feat: add more spans to opentelemetry plugin #12686

nic-6443 · 2025-10-19T10:25:15Z

Description

Run jaeger in local by https://www.jaegertracing.io/docs/2.11/getting-started/#all-in-one

Which issue(s) this PR fixes:

Fixes #

Checklist

I have explained the need for this PR and the problem it solves
I have explained the changes or the new features added to this PR
I have added tests corresponding to this change
I have updated the documentation to reflect this change
I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

Signed-off-by: Nic <qianyong@api7.ai>

apisix/core/response.lua

apisix/utils/span.lua

apisix/plugin.lua

apisix/core/response.lua

Revolyssup · 2025-10-27T07:04:50Z

Benchmark

Add route

 apisix git:(nic/opentelemetry) ✗ curl "<http://127.0.0.1:9180/apisix/admin/routes>" -X PUT \\
  -H "X-API-KEY: ${admin_key}" \\
  -d '{
    "id": "otel-tracing-route",
    "uri": "/headers",
    "plugins": {
      "opentelemetry": {
        "sampler": {
          "name": "always_on"
        }
      }
    },
    "upstream": {
      "type": "roundrobin",
      "nodes": {
        "127.0.0.1:8080": 1
      }
    }
  }'

Without instrumentation

Run wrk

 wrk -t4 -c100 -d30s http://localhost:9080/headers

Running 30s test @ http://localhost:9080/headers
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.68ms    8.44ms 187.90ms   93.17%
    Req/Sec     3.78k     0.87k    6.71k    71.82%
  451166 requests in 30.07s, 148.44MB read
Requests/sec:  15003.94
Transfer/sec:      4.94MB

With instrumentation

Run wrk

Running 30s test @ http://localhost:9080/headers
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    28.92ms   28.13ms 578.74ms   90.34%
    Req/Sec     1.01k   306.32     2.11k    69.74%
  120341 requests in 30.08s, 39.59MB read
Requests/sec:   4001.25
Transfer/sec:      1.32MB

Check traces

📊 Summary:

Enabling instrumentation caused roughly a 3.8× increase in latency and a 73% drop in throughput, indicating significant overhead from telemetry collection and reporting.

opentelemetry plugin disabled

 apisix git:(nic/opentelemetry) ✗ curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${admin_key}" \
  -d '{
    "id": "otel-tracing-route",
    "uri": "/headers",
    "upstream": {
      "type": "roundrobin",
      "nodes": {
        "127.0.0.1:8080": 1
      }
    }

Without instrumentation

apisix git:(master) ✗ wrk -t4 -c100 -d30s http://localhost:9080/headers

Running 30s test @ http://localhost:9080/headers
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.97ms    5.43ms 198.60ms   96.88%
    Req/Sec    10.67k     1.60k   13.13k    80.67%
  1276020 requests in 30.06s, 335.86MB read
Requests/sec:  42443.85
Transfer/sec:     11.17MB

With instrumentation

 apisix git:(nic/opentelemetry) ✗ wrk -t4 -c100 -d30s http://localhost:9080/headers

Running 30s test @ http://localhost:9080/headers
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.07ms    4.73ms 135.56ms   96.04%
    Req/Sec    10.02k     1.65k   13.20k    79.83%
  1198618 requests in 30.08s, 315.49MB read
Requests/sec:  39843.43
Transfer/sec:     10.49MB

📊 Summary:

Enabling the OpenTelemetry plugin introduced a small overhead (~6%), with a minor increase in latency (~3%) and a slight reduction in throughput. Overall, instrumentation impact here is minimal

Interpretation:

The instrumentation itself adds negligible overhead when inactive. The major slowdown observed earlier (≈68%) only occurs when the OpenTelemetry plugin is actually enabled and exporting traces, not merely when the instrumentation code exists. This is coming majorly from inject_core_spans function.

apisix/plugins/opentelemetry.lua

apisix/utils/span.lua

membphis · 2025-10-28T01:41:37Z

apisix/utils/span.lua

+    self.end_time = 0
+    self.kind = kind
+    self.attributes = self.attributes or {}
+    self.children = self.children or {}


the table attributes and children, they can be reused
we can run the flame graph, to confirm if we need this optimize

membphis · 2025-10-28T01:47:15Z

apisix/utils/span.lua

+
+function _M.set_status(self, code, message)
+    code = span_status.validate(code)
+    local status = {


local status = self.status if not status then status = { code = code, message = "" } self.status = status else status.code = code end if code == span_status.ERROR then status.message = message end

membphis · 2025-10-28T01:48:08Z

apisix/utils/span.lua

+
+
+function _M.set_attributes(self, ...)
+    for _, attr in ipairs({ ... }) do


current way is slow, make a try with new way:

for ... select('#' )

membphis · 2025-10-28T01:49:36Z

apisix/utils/span.lua

+    self.end_time = util.time_nano()
+end
+
+function _M.release(self)


the same as current, the table pool will call table.clear auto

Suggested change

function _M.release(self)

function _M.release(self)

tablepool.release(pool_name, self)

end

membphis · 2025-10-28T01:52:24Z

apisix/utils/stack.lua

+
+
+function _M.clear(self)
+    for i = 1, self._n do


we can call table.clear(self._data), which is much easier

membphis · 2025-10-28T01:54:17Z

apisix/utils/tracer.lua

+end
+
+function _M.finish_all_spans(code, message)
+    if not ngx.ctx._apisix_spans then


local apisix_spans = ngx.ctx._apisix_spans if not apisix_spans then return end for _, sp in pairs(apisix_spans) do if code then sp:set_status(code, message) end sp:finish() end

bzp2010 · 2025-10-29T03:57:54Z

apisix/ssl/router/radixtree_sni.lua

            return false, "failed to create radixtree router: " .. err
        end
        radixtree_router_ver = ssl_certificates.conf_version
+        tracer.finish_current_span()


Frankly, this API is confusing. The call to end a span should be something like span:finish(). Currently, using tracer.finish_current_span() implies that spans are managed by a context record within the tracer. This creates confusion for me—could it lead to misuse or conflicts when handling multiple requests in parallel or when a single request contains yield operations? Especially since we might optimize it in the future to use some kind of table pooling mechanism.

This is a concern. While I initially suspect it might not be an issue in single-threaded Nginx, we should avoid this confusion in the API design. It requires programmers to be thoroughly familiar with the OpenResty programming model, understanding the extent to which data is shared and how conflicts might arise. This imposes an additional explanation burden on us.

This is from a DX perspective. Technically, we may need to rethink how the stack is used to properly connect all spans.

I have the same issue when I review this code first time

APISIX will encounter an error if there are concurrent requests.

New way(should same as @bzp2010)

nic-6443 added 3 commits October 19, 2025 18:24

feat: add more spans to opentelemetry plugin

b5197a1

Signed-off-by: Nic <qianyong@api7.ai>

add todo

f6414fd

Signed-off-by: Nic <qianyong@api7.ai>

f

0317cf5

Signed-off-by: Nic <qianyong@api7.ai>

nic-6443 commented Oct 20, 2025

View reviewed changes

apisix/core/response.lua Show resolved Hide resolved

nic-6443 commented Oct 22, 2025

View reviewed changes

apisix/utils/span.lua Show resolved Hide resolved

Revolyssup added 6 commits October 23, 2025 12:57

return span on newspan

7944e9f

fix lint

ec7adef

fix CI

bb28639

fix opentelemetry3

1b29310

add test

57cac44

add test

f8d5974

Revolyssup marked this pull request as ready for review October 23, 2025 14:21

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Oct 23, 2025

revert

0c38fc7

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Oct 23, 2025

revert

ce4277c

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Oct 23, 2025

Revolyssup added 8 commits October 23, 2025 19:55

revert

1d6c4e2

f

1f0cedb

add test

e6c31c0

add plugin phase test

234f9f7

fix test

c3f9ae9

add test

03f3906

f

12d2513

f

f6e92b8

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Oct 24, 2025

Revolyssup added 3 commits October 24, 2025 18:32

fix tests

cab6620

remove todo

9d195e5

rename

b505630

Revolyssup requested review from membphis, moonming and nic-chen and removed request for membphis October 25, 2025 20:19

Revolyssup previously approved these changes Oct 25, 2025

View reviewed changes

Update opentelemetry6.t

9fc46b0

Revolyssup dismissed their stale review via 9fc46b0 October 26, 2025 05:54

Revolyssup self-requested a review October 27, 2025 05:41

nic-6443 commented Oct 27, 2025

View reviewed changes

apisix/plugin.lua Outdated Show resolved Hide resolved

nic-6443 commented Oct 27, 2025

View reviewed changes

apisix/core/response.lua Outdated Show resolved Hide resolved

Revolyssup added 2 commits October 27, 2025 11:56

apply suggestions

c5be5f9

fix lint

05eda81

Revolyssup added 2 commits October 27, 2025 12:38

fix

956935a

f

d2bd719

nic-6443 commented Oct 27, 2025

View reviewed changes

apisix/plugins/opentelemetry.lua Outdated Show resolved Hide resolved

apply suggestions

119f9d3

Revolyssup reviewed Oct 27, 2025

View reviewed changes

apisix/utils/span.lua Show resolved Hide resolved

Revolyssup added 3 commits October 27, 2025 17:45

fix test

91e37be

fix tests

fe16d9e

f

14b4101

membphis reviewed Oct 28, 2025

View reviewed changes

Revolyssup added 2 commits October 28, 2025 12:15

apply suggestions

079484d

fix lint

c310ade

Revolyssup requested a review from membphis October 28, 2025 07:55

bzp2010 reviewed Oct 29, 2025

View reviewed changes



		function _M.set_attributes(self, ...)
		for _, attr in ipairs({ ... }) do

feat: add more spans to opentelemetry plugin #12686

Are you sure you want to change the base?

feat: add more spans to opentelemetry plugin #12686

Uh oh!

Conversation

nic-6443 commented Oct 19, 2025 • edited by Revolyssup Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Which issue(s) this PR fixes:

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Revolyssup commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Without instrumentation

With instrumentation

📊 Summary:

opentelemetry plugin disabled

Without instrumentation

With instrumentation

📊 Summary:

Interpretation:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nic-6443 commented Oct 19, 2025 •

edited by Revolyssup

Loading

Revolyssup commented Oct 27, 2025 •

edited

Loading