Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.trulayer.ai/llms.txt

Use this file to discover all available pages before exploring further.

The Evals page is where you configure evaluators, run them against traces or datasets, and track regression over time.

Sub-sections

  • Results — every eval result your app has triggered, newest first.
  • Datasets — curated trace collections used as fixed test inputs for regression. Open a dataset to see its runs and trigger new ones.
  • Evaluators — built-in + custom evaluators you can trigger.

Results

One row per eval result. Columns: trace ID, evaluator, metric, score, label, latency, timestamp. Click any row for the reasoning (LLM evals) or the rule output.

Filters

A filter bar above the results table narrows the list. All filter state lives in the URL, so any filtered view is shareable — bookmark it or paste the URL into a PR comment.
  • Project — scope results to a single project in the current organization.
  • Metric — restrict to one evaluator’s metric (options come from the public eval catalog at GET /v1/eval-catalog).
  • Score rangemin and max inputs accept a float between 0.0 and 1.0. Values outside that range are ignored, and if min > max the conflicting value is dropped.
  • Date range — last 1h / 24h / 7d / 30d presets that map to from / to timestamps.
  • Clear — resets every active filter and removes each param from the URL.
These map 1:1 to the query parameters on GET /v1/eval: project_id, metric, score_min, score_max, from, to.

Export

The Export button at the top-right of the results table downloads the current filtered list as CSV or JSONL. Clicking it opens a small menu with:
  • Include reasoning — when checked, the LLM-judge rationale column is included (truncated to 500 characters per row). Off by default because rationales can be long and multiline, which makes CSVs unwieldy.
  • Download as CSVevals-YYYY-MM-DD.csv.
  • Download as JSONLevals-YYYY-MM-DD.jsonl, one JSON object per line.
Row caps are per plan: 100 for Starter, 5,000 for Pro/Team. If the export hits the cap, a toast appears indicating how many rows were returned and suggests either tighter filters or a plan upgrade.

Datasets

A dataset is a named set of trace IDs. Create one by:
  • Selecting traces from the Traces page and choosing Add to dataset
  • Filtering in the Feedback page and pushing highly-rated or highly-disputed traces into a dataset
  • Uploading a JSONL file via the dashboard or POST /v1/datasets
Every dataset has a stable ID — reference it from CI to run regression on every PR.

Runs

Runs live inside each dataset’s detail page — open Datasets, pick a dataset, and the runs panel lists every batch executed against it. Click Run evaluators on a dataset to trigger an evaluator over every trace in it. The resulting run is a row in that panel showing:
  • Dataset + evaluator + metric
  • Completion status and progress
  • Aggregate score (mean, median, histogram)
  • Pass/fail ratio if the metric is categorical
Runs can be compared pairwise — pick two runs over the same dataset and the dashboard diffs them by trace, highlighting regressions.

Export a run

The run detail page has an Export button in the header that downloads the full run — metadata, per-item scores, and any LLM-judge reasoning — as a single JSON file named eval-run-<id>.json. Useful for archiving, sharing with reviewers, or diffing across runs outside the dashboard.

Evaluators

Built-in evaluators (always available):
EvaluatorTypeMeasures
correctnessllmDoes the output match the ground-truth answer?
hallucinationllmDoes the output contain claims not grounded in retrieved context?
relevancellmDoes the output address what was asked?
toxicityllmIs the output safe and non-toxic?
json_schemaruleDoes the output match a provided JSON Schema?
latency_p95ruleIs trace latency under a threshold?
has_citationruleDoes the output include a citation pattern?
Custom evaluators can be created from the Evaluators tab — provide a rubric (LLM) or a Python function (rule).

Trigger programmatically

curl https://api.trulayer.ai/v1/eval \
  -H "Authorization: Bearer $TRULAYER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"trace_id":"01j...","evaluator_type":"llm","metric_name":"correctness"}'
Or configure evaluators to run automatically on every ingested trace matching a filter — see Evaluators → Triggers. The Trends tab on a dataset or evaluator plots the aggregate score (mean / pass rate) over time, one line per run. Use it to spot regressions introduced by a prompt tweak, a model swap, or a framework upgrade. Click a point to jump to the underlying run.

Regression tests

Pin a dataset as a regression dataset under a project’s settings. When a deployment publishes a new model id or prompt version (via the deployment.created webhook or the /v1/deployments API), TruLayer automatically runs every pinned dataset against the new configuration and posts the diff back to the triggering PR.
  • New failures (regressions) block the deployment when enforce is on.
  • Score deltas greater than the configured threshold (default 5%) are flagged.
  • A full comparison run shows up under the deployment’s detail page in the Control dashboard.