Skip to content

Conversation

@sveniu
Copy link
Contributor

@sveniu sveniu commented Oct 29, 2025

What does this PR do?

Add support for collecting metrics from Couchbase's Prometheus endpoint as an alternative to the legacy REST API.

The Prometheus endpoint exposes significantly more comprehensive metrics than the legacy REST API: the total metric count increases from ~400 to ~1,900.

End users will get greatly improved insight into the performance of their Couchbase clusters.

Motivation

The legacy REST metrics are too limited for running production-critical workloads on Couchbase. To troubleshoot complex issues, operators often have to use the Couchbase admin UI to access metrics that provide useful diagnostic insight. Users with access to Couchbase Support are familiar with the pattern: troubleshooting an issue, collecting cluster logs, uploading to Couchbase Support, and having Support recreate various metrics that pinpoint the problem. Those diagnostic metrics are rarely available through the legacy REST endpoint.

By enabling the comprehensive Prometheus metrics in Datadog, operators gain cluster visibility comparable to that of Couchbase Support, significantly reducing troubleshooting time and effort.

The metric count increase provides Datadog users with a substantially more useful integration.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Add support for collecting metrics from Couchbase's Prometheus/OpenMetrics
endpoint as an alternative to the legacy REST API.

The Prometheus metrics are much more comprehensive than those from the REST
API, and will give users a greater insight into how their Couchbase clusters
are performing.

The `prometheus_url` configuration option allows users to specify the
Prometheus endpoint (default: http://localhost:8091/metrics for Couchbase 7.0+).
The `server` option remains for backward compatibility with the legacy REST API.

This change is backward compatible - existing configurations using `server`
continue to work with the REST API without modification.
Use the legacy OpenMetricsBaseCheck since it's more lenient with malformed
data, which is the case with Couchbase.

Example: Many Couchbase metrics lack proper Prometheus TYPE declarations.
OpenMetricsBaseCheck handles this better by allowing explicit type overrides,
whereas OpenMetricsBaseCheckV2 is stricter about requiring proper TYPE
annotations.

Example: OpenMetricsBaseCheck provides better control over histogram output
format. We use:

  - send_histograms_buckets=True for traditional histogram format
  - Separate .sum, .count, and .bucket metrics (easier to work with in Datadog)
  - V2 defaults to distribution metrics which may not match existing dashboards

The legacy check's type_overrides parameter works seamlessly with our metric
transformation utilities, allowing us to fix Couchbase's incorrect/missing type
metadata systematically.

Using the legacy OpenMetricsBaseCheck is also used in other integrations like
HAProxy, Cilium.
Provide functions to convert between Couchbase Prometheus metric names and
Datadog metric names, including metric mapping and type overrides.

The transformation handles:
- Name conversion: `kv_dcp_backoff` → `kv.dcp_backoff`
- Legacy/misnamed metrics via RAW_METRIC_NAME_MAP
- Type overrides for metrics with missing/incorrect Prometheus TYPE metadata
- Histogram metric handling (_bucket, _count, _sum suffixes)

These utilities are used by CouchbaseCheckV2 to configure the OpenMetrics
metric map and type overrides, ensuring proper metric collection and typing.
Provide functions to convert between Couchbase Prometheus metric names and
Datadog metric names, including metric mapping and type overrides.

The transformation handles:
- Name conversion: `kv_dcp_backoff` → `kv.dcp_backoff`
- Legacy/misnamed metrics via RAW_METRIC_NAME_MAP
- Type overrides for metrics with missing/incorrect Prometheus TYPE metadata
- Histogram metric handling (_bucket, _count, _sum suffixes)

These utilities are used by CouchbaseCheckV2 to configure the OpenMetrics
metric map and type overrides, ensuring proper metric collection and typing.
Add comprehensive Prometheus metrics metadata curated from official Couchbase
sources. This establishes metadata.csv as the authoritative source for all
Couchbase metric definitions (1,841 total metrics).

The following was done to produce the new metadata:

Download Couchbase's metric metadata from these URLs:

  https://github.com/couchbase/docs-server/raw/refs/heads/release/7.6/modules/metrics-reference/attachments/cm_metrics_metadata.json
  https://github.com/couchbase/docs-server/raw/refs/heads/release/7.6/modules/metrics-reference/attachments/kv_metrics_metadata.json
  https://github.com/couchbase/ns_server/raw/refs/heads/master/etc/metrics_metadata.json
  https://github.com/couchbase/goxdcr/raw/refs/heads/master/etc/metrics_metadata.json
  https://github.com/couchbase/indexing/raw/refs/heads/master/secondary/docs/metrics_metadata.json
  https://github.com/couchbase/query/raw/refs/heads/master/etc/metrics_metadata.json
  https://github.com/couchbase/cbft/raw/refs/heads/master/etc/metrics_metadata.json
  https://github.com/couchbase/docs-server/raw/refs/heads/release/7.6/modules/metrics-reference/attachments/backup_metrics_metadata.json
  https://github.com/couchbase/docs-server/raw/refs/heads/release/7.6/modules/metrics-reference/attachments/cbas_metrics_metadata.json
  https://github.com/couchbase/eventing/raw/refs/heads/master/parser/metrics_metadata.json
  https://github.com/couchbase/docs-sync-gateway/raw/refs/heads/release/3.2/modules/ROOT/assets/attachments/metrics_metadata.json

Automatically transform the JSON to match the format of metadata.csv:
  Object key → metric_name
  type → metric_type
  unit → unit_name
  help → description
  uiName → short_name

Manual curation was performed to fix metric types, units, per_unit, and
orientation fields where Couchbase metadata was incorrect or missing. This was
labor-intensive and took many hours.

Of all the new metrics, 21 were already present in the old metadata. Due
to updated description fields, the old entries were removed and replaced
with the updated versions.

Various metrics with suspected faulty or legacy names were renamed:
  audit_queue_length → cm_audit_queue_length
  audit_unsuccessful_retries → cm_audit_unsuccessful_retries
  couch_* → kv_couch_*
  total_knn_queries_rejected_by_throttler → fts_num_knn_queries_rejected_by_throttler

Around 80 metrics are returned by the Prometheus endpoint, but are not
documented anywhere, including not in the Couchbase source code. These are kept
with description "N/A" instead of being dropped and potentially losing insight
into interesting data, however unlikely.
Correct metric type classifications for index metrics in the legacy REST API
check.

`items_count`: This is a gauge, contrary to what the name seems to indicate.

`total_scan_duration`: This is a counter (represents cumulative scan time).

These corrections ensure proper metric typing when using the legacy REST API
collection method (`server` configuration option).
Introduce COUCHBASE_METRIC_SOURCE environment variable to enable testing
both legacy REST API and Prometheus collection methods.
Copy link
Contributor

@OliviaShoup OliviaShoup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @sveniu, thanks so much for contributing to the docs! I've made a Jira ticket for someone on the docs team to review this PR. We'll get to it ASAP. Thanks again!

@codecov
Copy link

codecov bot commented Oct 29, 2025

Codecov Report

❌ Patch coverage is 76.98413% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.00%. Comparing base (8653d93) to head (f804962).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kayayarai kayayarai requested a review from a team October 30, 2025 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants