-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add support for Prometheus metrics #21780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Add support for collecting metrics from Couchbase's Prometheus/OpenMetrics endpoint as an alternative to the legacy REST API. The Prometheus metrics are much more comprehensive than those from the REST API, and will give users a greater insight into how their Couchbase clusters are performing. The `prometheus_url` configuration option allows users to specify the Prometheus endpoint (default: http://localhost:8091/metrics for Couchbase 7.0+). The `server` option remains for backward compatibility with the legacy REST API. This change is backward compatible - existing configurations using `server` continue to work with the REST API without modification.
Use the legacy OpenMetricsBaseCheck since it's more lenient with malformed data, which is the case with Couchbase. Example: Many Couchbase metrics lack proper Prometheus TYPE declarations. OpenMetricsBaseCheck handles this better by allowing explicit type overrides, whereas OpenMetricsBaseCheckV2 is stricter about requiring proper TYPE annotations. Example: OpenMetricsBaseCheck provides better control over histogram output format. We use: - send_histograms_buckets=True for traditional histogram format - Separate .sum, .count, and .bucket metrics (easier to work with in Datadog) - V2 defaults to distribution metrics which may not match existing dashboards The legacy check's type_overrides parameter works seamlessly with our metric transformation utilities, allowing us to fix Couchbase's incorrect/missing type metadata systematically. Using the legacy OpenMetricsBaseCheck is also used in other integrations like HAProxy, Cilium.
Provide functions to convert between Couchbase Prometheus metric names and Datadog metric names, including metric mapping and type overrides. The transformation handles: - Name conversion: `kv_dcp_backoff` → `kv.dcp_backoff` - Legacy/misnamed metrics via RAW_METRIC_NAME_MAP - Type overrides for metrics with missing/incorrect Prometheus TYPE metadata - Histogram metric handling (_bucket, _count, _sum suffixes) These utilities are used by CouchbaseCheckV2 to configure the OpenMetrics metric map and type overrides, ensuring proper metric collection and typing.
Provide functions to convert between Couchbase Prometheus metric names and Datadog metric names, including metric mapping and type overrides. The transformation handles: - Name conversion: `kv_dcp_backoff` → `kv.dcp_backoff` - Legacy/misnamed metrics via RAW_METRIC_NAME_MAP - Type overrides for metrics with missing/incorrect Prometheus TYPE metadata - Histogram metric handling (_bucket, _count, _sum suffixes) These utilities are used by CouchbaseCheckV2 to configure the OpenMetrics metric map and type overrides, ensuring proper metric collection and typing.
Add comprehensive Prometheus metrics metadata curated from official Couchbase sources. This establishes metadata.csv as the authoritative source for all Couchbase metric definitions (1,841 total metrics). The following was done to produce the new metadata: Download Couchbase's metric metadata from these URLs: https://github.com/couchbase/docs-server/raw/refs/heads/release/7.6/modules/metrics-reference/attachments/cm_metrics_metadata.json https://github.com/couchbase/docs-server/raw/refs/heads/release/7.6/modules/metrics-reference/attachments/kv_metrics_metadata.json https://github.com/couchbase/ns_server/raw/refs/heads/master/etc/metrics_metadata.json https://github.com/couchbase/goxdcr/raw/refs/heads/master/etc/metrics_metadata.json https://github.com/couchbase/indexing/raw/refs/heads/master/secondary/docs/metrics_metadata.json https://github.com/couchbase/query/raw/refs/heads/master/etc/metrics_metadata.json https://github.com/couchbase/cbft/raw/refs/heads/master/etc/metrics_metadata.json https://github.com/couchbase/docs-server/raw/refs/heads/release/7.6/modules/metrics-reference/attachments/backup_metrics_metadata.json https://github.com/couchbase/docs-server/raw/refs/heads/release/7.6/modules/metrics-reference/attachments/cbas_metrics_metadata.json https://github.com/couchbase/eventing/raw/refs/heads/master/parser/metrics_metadata.json https://github.com/couchbase/docs-sync-gateway/raw/refs/heads/release/3.2/modules/ROOT/assets/attachments/metrics_metadata.json Automatically transform the JSON to match the format of metadata.csv: Object key → metric_name type → metric_type unit → unit_name help → description uiName → short_name Manual curation was performed to fix metric types, units, per_unit, and orientation fields where Couchbase metadata was incorrect or missing. This was labor-intensive and took many hours. Of all the new metrics, 21 were already present in the old metadata. Due to updated description fields, the old entries were removed and replaced with the updated versions. Various metrics with suspected faulty or legacy names were renamed: audit_queue_length → cm_audit_queue_length audit_unsuccessful_retries → cm_audit_unsuccessful_retries couch_* → kv_couch_* total_knn_queries_rejected_by_throttler → fts_num_knn_queries_rejected_by_throttler Around 80 metrics are returned by the Prometheus endpoint, but are not documented anywhere, including not in the Couchbase source code. These are kept with description "N/A" instead of being dropped and potentially losing insight into interesting data, however unlikely.
Correct metric type classifications for index metrics in the legacy REST API check. `items_count`: This is a gauge, contrary to what the name seems to indicate. `total_scan_duration`: This is a counter (represents cumulative scan time). These corrections ensure proper metric typing when using the legacy REST API collection method (`server` configuration option).
Introduce COUCHBASE_METRIC_SOURCE environment variable to enable testing both legacy REST API and Prometheus collection methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @sveniu, thanks so much for contributing to the docs! I've made a Jira ticket for someone on the docs team to review this PR. We'll get to it ASAP. Thanks again!
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
What does this PR do?
Add support for collecting metrics from Couchbase's Prometheus endpoint as an alternative to the legacy REST API.
The Prometheus endpoint exposes significantly more comprehensive metrics than the legacy REST API: the total metric count increases from ~400 to ~1,900.
End users will get greatly improved insight into the performance of their Couchbase clusters.
Motivation
The legacy REST metrics are too limited for running production-critical workloads on Couchbase. To troubleshoot complex issues, operators often have to use the Couchbase admin UI to access metrics that provide useful diagnostic insight. Users with access to Couchbase Support are familiar with the pattern: troubleshooting an issue, collecting cluster logs, uploading to Couchbase Support, and having Support recreate various metrics that pinpoint the problem. Those diagnostic metrics are rarely available through the legacy REST endpoint.
By enabling the comprehensive Prometheus metrics in Datadog, operators gain cluster visibility comparable to that of Couchbase Support, significantly reducing troubleshooting time and effort.
The metric count increase provides Datadog users with a substantially more useful integration.
Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged