Skip to content

fluentbit_ metrics stop being sent to prometheus_remote_write output about 1 hour after start #11082

@g-cos

Description

@g-cos

Bug Report

Approximately one hour after fluentbit starts, all fluentbit_ internal metrics begin to be omitted from what is written with the prometheus_remote_write output. These are all of the metrics from the fluentbit_metrics input. This continues indefinitely, until fluentbit is restarted; these metrics never start getting written again by the existing process.

Metrics from any other inputs that produce metrics, such as prometheus_scrape and prometheus_textfile, continue to be sent normally. Also, if a prometheus_exporter output is configured, the fluentbit_metrics metrics are still exported there.

To Reproduce
I can reproduce this with a minimal configuration, running on my local macbook.
After starting up Victoria Metrics listening on localhost:8428, I run fluent-bit with this config:

---
service:
  flush: 1
  daemon: Off
  log_level: debug
  # Enable/Disable the built-in HTTP Server for metrics
  http_server: Off
  http_listen: 127.0.0.1
  http_port: 2020

pipeline:
  inputs:
    - name: fluentbit_metrics
      tag: metrics_fluentbit
      scrape_interval: 60s

  outputs:
    - name: prometheus_remote_write
      match: 'metrics_*'
      host: localhost
      port: 8428
      uri: /api/v1/write
      retry_limit: 2
      log_response_payload: True
      tls: Off
      add_label: job fluentbit2

Metrics such as fluentbit_output_upstream_total_connections and fluentbit_build_info begin appearing immediately, but cease after approximately one hour. After that time, fluentbit continues to log that it is sending prometheus remote writes, and continues to log HTTP status=204 and FLB_OK, but those metrics cease.

If I add an additional input with any other metrics, those metrics continue to be sent. For example, I created a file /tmp/node_info.prom with a single static metric, and added this input to the config:

    - name: prometheus_textfile
      tag: metrics_textfile
      path: /tmp/node_info.prom
      scrape_interval: 60s

After the fluentbit_ metrics ceased, this one additional metric continued to be sent for as long as the fluentbit process ran, which was more than a day in a couple of my tests.

Your Environment

  • Version used: 4.0.3, 4.0.8, 4.1.1 (I reproduced with the same minimal config in all three of these versions)
  • Configuration: See above
  • Server type and version: Macbook Pro, and AWS EC2
  • Operating System and version: macOS Sequoia 15.7.1 and Amazon Linux 2023

Additional context
We started observing this issue earlier this month. We're using fluentbit metrics to monitor fluentbit and possibly alert on problems, but can no longer do so because these metrics are no longer being sent consistently.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions