Tempo: Fix Grafana Query Button Loading Issue
h1. Tempo: Fixing the Stuck Grafana Query Button When Accessing Recent Data
Ever been deep in your system's logs, trying to pinpoint a recent issue, only to have the Grafana query button get stuck in an endless loading loop? It's a frustrating experience, especially when you're dealing with the most recent 30 minutes of data in Tempo. You click that button, expecting immediate insights, and instead, you're met with a spinning icon that just won't quit. This issue seems to specifically target queries within that recent 30-minute window, while older data queries zip along without a hitch. If you've encountered this, you're not alone, and thankfully, there are ways to troubleshoot and resolve this pesky problem.
Understanding the "Stuck Button" Phenomenon in Tempo
The core of this issue often lies in how Tempo handles queries, particularly when searching for very recent data. When you query data older than 30 minutes, Tempo might be able to leverage optimizations or a different data access path. However, when you look at the most recent 30 minutes, Tempo needs to actively search and aggregate data that might still be in the process of being ingested or finalized. This can involve hitting different parts of the system, potentially leading to bottlenecks or timeouts. The Grafana query button, in this scenario, is essentially waiting for a response that either never comes, comes too late, or gets caught in a loop of trying to fetch and process data that's in a transitional state. This is particularly noticeable when switching between Grafana's "Search" and "TraceQL" tabs, suggesting that the way these different query interfaces interact with Tempo's backend under high-demand, recent-data scenarios is a key factor. Your configuration shows Tempo version 2.8.1 running in a scalable-single-binary mode with a two-node memberlist cluster. While this setup is generally robust, the specific interaction with recent data queries in Grafana points towards potential inefficiencies or misconfigurations in how Tempo's query frontend and ingesters are handling this time-sensitive data.
Deconstructing the Configuration: Key Parameters and Their Impact
Let's dive into your Tempo configuration to see where potential optimizations can be made. The provided configuration is quite extensive, covering everything from network timeouts to storage details. When troubleshooting query performance, especially for recent data, several sections are particularly relevant:
Server and Network Timeouts
server:
http_listen_port: 3200
grpc_listen_port: 9095
grpc_server_max_recv_msg_size: 107374182400
grpc_server_max_send_msg_size: 107374182400
http_server_read_timeout: 777s
http_server_write_timeout: 600s
log_level: debug
The http_server_read_timeout and http_server_write_timeout values are quite generous, which is good for preventing premature timeouts on large or slow requests. However, if the underlying query is taking longer than expected due to the nature of recent data, these long timeouts might just mask the problem, leading to the UI appearing to hang rather than erroring out. The grpc_server_max_recv_msg_size and max_send_msg_size are set very high, which is generally fine unless there's a network intermediary limiting packet sizes.
Query Frontend Settings
query_frontend:
search:
default_result_limit: 100
max_duration: 720h
duration_slo: 5s
throughput_bytes_slo: 1073741824
metadata_slo:
duration_slo: 5s
throughput_bytes_slo: 1073741824
trace_by_id:
duration_slo: 5s
metrics:
max_duration: 720h
concurrent_jobs: 100
multi_tenant_queries_enabled: false
This section is crucial. The duration_slo (Service Level Objective) for searches and metadata is set to a very aggressive 5 seconds. When querying recent data, it's possible that the system is struggling to meet this SLO, leading to prolonged processing and the hanging UI. The query_ingesters_until parameter, which you mentioned trying to adjust, is particularly relevant here. Setting it too low might cause the query frontend to give up too early on fetching recent data, while setting it too high might lead to long waits. The max_result_limit being 0 is also interesting; typically, this would be set to a positive value. If it's intended to mean unlimited, that could also contribute to performance issues if a query returns an unexpectedly large dataset.
Querier Settings
querier:
frontend_worker:
frontend_address: ${HOSTNAME}:9095
grpc_client_config:
max_recv_msg_size: 10737418240
max_send_msg_size: 10737418240
search:
query_timeout: 600s
max_concurrent_queries: 30
The query_timeout for the querier is set to 600 seconds, which is quite substantial. This timeout applies to the querier itself, not necessarily the entire request lifecycle from Grafana. The frontend_address correctly points to the query frontend. The max_concurrent_queries limits how many searches can happen simultaneously. If recent queries are complex and time-consuming, they might be consuming these slots for extended periods.
Storage Configuration
storage:
trace:
backend: azure
azure:
storage_account_name: <ACCOUNT>
container_name: tempo
blocklist_poll: 5m
wal:
path: /export/tempo_data/storage_wal
version: vParquet3
block:
version: vParquet3
While less directly related to the loading issue, the efficiency of your Azure backend storage can indirectly impact query performance. If retrieving data blocks from Azure is slow, it will exacerbate any existing performance bottlenecks, especially for larger queries involving recent data. The blocklist_poll frequency might also play a role if there are many recently written blocks.
Memberlist Configuration
memberlist:
randomize_node_name: false
retransmit_factor: 2
gossip_nodes: 2
# ... other memberlist settings
join_members:
- <IP1>:7946
- <IP2>:7946
For a two-node cluster, the memberlist configuration looks standard. Ensuring reliable communication between these nodes is vital for distributed tracing systems like Tempo. Any network instability or configuration issues here could lead to slower inter-node communication, affecting query aggregation.
query_frontend.search.query_ingesters_until - The Key Parameter?
Based on your description, the parameter query_frontend.search.query_ingesters_until is the most likely candidate for tuning. This setting dictates how long the query frontend will wait for data from the ingesters for a given query. When querying the most recent 30 minutes, Tempo is actively trying to gather traces that might still be in the ingester's memory or recently flushed buffers. If this value is set too low (e.g., less than 30 minutes), the query might time out before all relevant data can be collected. If it's set too high, it could lead to excessively long waits.
Troubleshooting Steps to Resolve the Stuck Query Button
Here’s a systematic approach to tackle the issue:
-
Adjust
query_frontend.search.query_ingesters_until: This is your primary suspect. Since the problem occurs for the last 30 minutes, try increasing this value slightly beyond 30 minutes. For example, set it to45mor60m. This gives Tempo more time to collect data from ingesters. Be cautious not to set it excessively high, as that could lead to very long query times and potentially other timeouts.query_frontend: search: # ... other settings query_ingesters_until: 45m0s # Example: increase from 30m -
Examine
query_frontend.search.duration_slo: You have this set to5s. If Tempo consistently struggles to return results within this SLO for recent data, it might be contributing to the perceived freeze. Consider slightly increasing this, perhaps to10sor15s, to provide more breathing room for recent queries. Remember, SLOs are targets; consistently missing them might indicate an underlying performance issue.query_frontend: search: # ... other settings duration_slo: 10s # Example: increase from 5s -
Review
querier.query_timeout: While you have a generous600stimeout for the querier, ensure this is sufficient for the longest possible query of recent data you'd expect. If recent queries are genuinely taking longer than 10 minutes, you might need to adjust this, though it's less likely to be the direct cause of the stuck button if other queries work fine. -
Check
server.http_server_read_timeoutandwrite_timeout: While already high, if there's a specific network condition or a very large data aggregation that exceeds these, it could cause issues. However, the fact that older data works suggests this isn't the primary bottleneck. -
Monitor Resource Utilization: Keep an eye on CPU, memory, and network I/O on your Tempo nodes, especially the query frontend and ingesters, during periods when you experience the stuck button. High resource utilization could indicate that the system is struggling to keep up with the demand for recent data.
-
Analyze Tempo Logs: Your provided logs show