-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Bug Report
Approximately one hour after fluentbit starts, all fluentbit_ internal metrics begin to be omitted from what is written with the prometheus_remote_write output. These are all of the metrics from the fluentbit_metrics input. This continues indefinitely, until fluentbit is restarted; these metrics never start getting written again by the existing process.
Metrics from any other inputs that produce metrics, such as prometheus_scrape and prometheus_textfile, continue to be sent normally. Also, if a prometheus_exporter output is configured, the fluentbit_metrics metrics are still exported there.
To Reproduce
I can reproduce this with a minimal configuration, running on my local macbook.
After starting up Victoria Metrics listening on localhost:8428, I run fluent-bit with this config:
---
service:
flush: 1
daemon: Off
log_level: debug
# Enable/Disable the built-in HTTP Server for metrics
http_server: Off
http_listen: 127.0.0.1
http_port: 2020
pipeline:
inputs:
- name: fluentbit_metrics
tag: metrics_fluentbit
scrape_interval: 60s
outputs:
- name: prometheus_remote_write
match: 'metrics_*'
host: localhost
port: 8428
uri: /api/v1/write
retry_limit: 2
log_response_payload: True
tls: Off
add_label: job fluentbit2
Metrics such as fluentbit_output_upstream_total_connections and fluentbit_build_info begin appearing immediately, but cease after approximately one hour. After that time, fluentbit continues to log that it is sending prometheus remote writes, and continues to log HTTP status=204 and FLB_OK, but those metrics cease.
If I add an additional input with any other metrics, those metrics continue to be sent. For example, I created a file /tmp/node_info.prom with a single static metric, and added this input to the config:
- name: prometheus_textfile
tag: metrics_textfile
path: /tmp/node_info.prom
scrape_interval: 60s
After the fluentbit_ metrics ceased, this one additional metric continued to be sent for as long as the fluentbit process ran, which was more than a day in a couple of my tests.
Your Environment
- Version used: 4.0.3, 4.0.8, 4.1.1 (I reproduced with the same minimal config in all three of these versions)
- Configuration: See above
- Server type and version: Macbook Pro, and AWS EC2
- Operating System and version: macOS Sequoia 15.7.1 and Amazon Linux 2023
Additional context
We started observing this issue earlier this month. We're using fluentbit metrics to monitor fluentbit and possibly alert on problems, but can no longer do so because these metrics are no longer being sent consistently.