The Evals page is where you configure evaluators, run them against traces or datasets, and track regression over time.Documentation Index
Fetch the complete documentation index at: https://docs.trulayer.ai/llms.txt
Use this file to discover all available pages before exploring further.
Sub-sections
- Results — every eval result your app has triggered, newest first.
- Datasets — curated trace collections used as fixed test inputs for regression. Open a dataset to see its runs and trigger new ones.
- Evaluators — built-in + custom evaluators you can trigger.
Results
One row per eval result. Columns: trace ID, evaluator, metric, score, label, latency, timestamp. Click any row for the reasoning (LLM evals) or the rule output.Filters
A filter bar above the results table narrows the list. All filter state lives in the URL, so any filtered view is shareable — bookmark it or paste the URL into a PR comment.- Project — scope results to a single project in the current organization.
- Metric — restrict to one evaluator’s metric (options come from the public eval catalog at
GET /v1/eval-catalog). - Score range —
minandmaxinputs accept a float between0.0and1.0. Values outside that range are ignored, and ifmin > maxthe conflicting value is dropped. - Date range — last 1h / 24h / 7d / 30d presets that map to
from/totimestamps. - Clear — resets every active filter and removes each param from the URL.
GET /v1/eval: project_id, metric, score_min, score_max, from, to.
Export
The Export button at the top-right of the results table downloads the current filtered list as CSV or JSONL. Clicking it opens a small menu with:- Include reasoning — when checked, the LLM-judge rationale column is included (truncated to 500 characters per row). Off by default because rationales can be long and multiline, which makes CSVs unwieldy.
- Download as CSV —
evals-YYYY-MM-DD.csv. - Download as JSONL —
evals-YYYY-MM-DD.jsonl, one JSON object per line.
Datasets
A dataset is a named set of trace IDs. Create one by:- Selecting traces from the Traces page and choosing Add to dataset
- Filtering in the Feedback page and pushing highly-rated or highly-disputed traces into a dataset
- Uploading a JSONL file via the dashboard or
POST /v1/datasets
Runs
Runs live inside each dataset’s detail page — open Datasets, pick a dataset, and the runs panel lists every batch executed against it. Click Run evaluators on a dataset to trigger an evaluator over every trace in it. The resulting run is a row in that panel showing:- Dataset + evaluator + metric
- Completion status and progress
- Aggregate score (mean, median, histogram)
- Pass/fail ratio if the metric is categorical
Export a run
The run detail page has an Export button in the header that downloads the full run — metadata, per-item scores, and any LLM-judge reasoning — as a single JSON file namedeval-run-<id>.json. Useful for archiving, sharing with reviewers, or diffing across runs outside the dashboard.
Evaluators
Built-in evaluators (always available):| Evaluator | Type | Measures |
|---|---|---|
correctness | llm | Does the output match the ground-truth answer? |
hallucination | llm | Does the output contain claims not grounded in retrieved context? |
relevance | llm | Does the output address what was asked? |
toxicity | llm | Is the output safe and non-toxic? |
json_schema | rule | Does the output match a provided JSON Schema? |
latency_p95 | rule | Is trace latency under a threshold? |
has_citation | rule | Does the output include a citation pattern? |
Trigger programmatically
Trends
The Trends tab on a dataset or evaluator plots the aggregate score (mean / pass rate) over time, one line per run. Use it to spot regressions introduced by a prompt tweak, a model swap, or a framework upgrade. Click a point to jump to the underlying run.Regression tests
Pin a dataset as a regression dataset under a project’s settings. When a deployment publishes a new model id or prompt version (via thedeployment.created webhook or the /v1/deployments API), TruLayer automatically runs every pinned dataset against the new configuration and posts the diff back to the triggering PR.
- New failures (regressions) block the deployment when enforce is on.
- Score deltas greater than the configured threshold (default 5%) are flagged.
- A full comparison run shows up under the deployment’s detail page in the Control dashboard.