-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Is your proposal related to a problem?
Presently, the Thanos Query executor services queries that consume data from Prometheus by making Series gRPC requests to Sidecar, which makes Prometheus gRPC remote-read requests to Prometheus to fetch the entire requested series.
This can be very expensive in network I/O, memory, serialization/deserialization CPU time etc, as Thanos will fetch a whole series only to aggregate away most of it and/or filter out most of the series with joins, operators and functions in the PromQL executor.
It's particularly problematic when the series is large, and particularly wasteful when the resulting query output is a relatively small subset or aggregate of the input.
For particularly high cardinality result-sets it can cause OOMs in Sidecar because Sidecar buffers and sorts the remote-read response from Prometheus to work around the Prometheus bug with returning unsorted series where external labels are present.
Describe the solution you'd like
Thanos Query in distributed query mode should support using the remote engine to push-down PromQL queries or partial queries directly to Prometheus for execution, then consume the result.
A proof of concept could be done by implementing PrometheusRequest in Sidecar, and allowing Query in distributed mode to consider Sidecar as a valid downstream for push-down.
If support for a gRPC Query protocol is added directly to Prometheus, this could then pivot easily to directly executing the PromQL in Prometheus, bypassing Sidecar entirely.
(Per discussion with @MichaHoffmann and some others in CNCF Slack.)
Describe alternatives you've considered
The current alternative involves running a Query instance in front of Sidecar, which in turn sits in front of Prometheus. The Query instance serves PrometheusRequest for an upstream distributed-mode Query instance. This works, but interposes yet more components that all have significant requirements for container memory, adds latency, adds more serialization/deserialization passes, etc. It's particularly problematic when high-cardinality results and/or range requests with large numbers of steps are made, because all components need to be sized for the worst-case memory requirements of any likely query to avoid OOMs and service disruptions, and/or need to be multiple-instanced with, again, more memory overhead.
I've looked at ways to improve Sidecar memory overhead by improving the series fetching and sorting problem, but it's not a simple problem due to the various interactions of components. And in the end, it'll still fetch the whole series.
I've also briefly looked at using series fetch hints to reduce the labels returned by Sidecar to Query, so the Thanos executor can tell Sidecar to omit labels it knows it won't need on result series. This will reduce bandwidth and memory use where series have many wide labels but most of them will be promptly discarded by the executor anyway. This would help, but not as much.
Potential problems
Different engines for different parts of one query
Thanos Query in Distributed mode will use the Thanos PromQL engine. But Prometheus inherently uses the native Prometheus engine.
So a PromQL push-down from distributed Query to Prometheus could execute part of the PromQL with the Thanos engine, and other parts with the Prometheus engine. Or the whole PromQL expression could be pushed-down to Prometheus and executed with the Prometheus engine alone.
This would break queries that use Thanos-specific engine extensions, and might have surprising results in edge cases where the two executors differ in behaviour.
This feature would not be available when Thanos extension PromQL functions are used in a query; Thanos Query would need to fall back to Series gRPC execution instead, or return an error if this is disabled by the operator.
Different lookback deltas
Prometheus configures a global --query.lookback-delta. If a query pushed-down to Prometheus used a different lookback delta to the delta used in the Thanos engine, this could cause subtly wrong and confusing results.
Thankfully Prometheus supports per-query lookback delta configuration (though it doesn't appear to be documented in their HTTP API docs). So Query can just pass down the lookback delta for the query to Prometheus.
Range vector selectors overlapping the Prometheus <-> object storage boundary
Queries could contain range-vector selectors that select data that is not fully within Prometheus's TSDB retention limits. Such a query cannot be executed correctly by pushing-down execution of the PromQL to Prometheus.
For example, Prometheus might have a 1 day lookback but the operator runs a range query that does label-matching on max by (some_label) (some_series{}[7d] @ end()) to select which series to include in the result. This would work in normal distributed Query mode by pushing-down to another Query instance, which use Series gRPC to fetch the whole series from a combination of Prometheus (via Sidecar), Thanos Store, and if required Thanos Receive, integrate the whole series in memory, then execute the query on it. It cannot be executed by pushing down the PromQL to each of Prometheus, Store and Receive then integrating the resulting responses.
Query already supports tracking the time-ranges covered by each source, so it can detect these cases. If a query that cannot be pushed-down is detected, the distributed Query instance could:
- Fall back to Series gRPC execution (possibly then failing if it exceeds configured limits); or
- Reject the query and report an error to the user
... depending on how the operator configures the Query instance.
Additional context
For maximum benefit this would be used in conjunction with
- Streaming Result for Query/QueryRange HTTP API prometheus/prometheus#10040
- Add iterator-returning variant of promql.Query engine interface prometheus/prometheus#17276 and an implementation of it for the Thanos PromQL engine too
- Add sidecar flag bypass Prometheus response buffering and re-sorting #8487 (to improve Series gRPC fallback)