The Failures page groups traces by failure signature, surfaces regressions, and lets you configure alerts so you find out about incidents before your users do.Documentation Index
Fetch the complete documentation index at: https://docs.trulayer.ai/llms.txt
Use this file to discover all available pages before exploring further.
Cluster list
Each row is a failure cluster — a group of traces that share the same normalised error signature (error type, message skeleton, and top contributing span). Columns:- Signature — human-readable cluster label (e.g.
timeout in llm:openai.chat.completions). - Count — traces in the cluster within the selected window.
- Trend — sparkline of cluster volume, last 24 h.
- First / last seen — helpful for spotting regressions tied to a deploy.
- Impact — unique sessions and unique users affected.
- Status —
new,acknowledged,resolved.
Cluster detail
Click any cluster to open the root-cause view.- Top contributing spans — the 3–5 span names most frequently marked as the failure origin across traces in the cluster, with counts and average latency.
- Representative error messages — de-duplicated error strings with per-variant counts. Click to pivot to a matching trace.
- Linked traces — paginated list of every trace in the cluster; click through to the trace detail and span waterfall.
- Feedback overlay — any negative user feedback attached to cluster traces shows up here for context.
status = resolved to see them.
Alert rules
From Failures → Alert rules, create rules that fire on cluster or failure-rate conditions. Rule fields:- Name — shown on the alert payload.
- Trigger — one of:
failure_rate > thresholdover a rolling window (e.g.> 2% over 5 minutes)cluster_count > thresholdfor a new or existing clustercluster.first_seen— fires the first time a signature appears
- Scope — project, environment, model, or metadata filter.
- Channel — webhook URL (JSON payload) or email recipients.
- Cooldown — suppress repeated fires within the interval (default 15 minutes).
Common workflows
- New deploy monitoring. Filter to
first_seen > deploy_timeto see clusters introduced by the latest release. - Triage the weekly on-call. Sort clusters by Impact desc, work top-down.
- Close the loop with ownership. Add a metadata filter (
metadata.team = "payments") to alert rules so only the right team gets paged.