Evaluations

An eval is an automated judgement of a trace. TruLayer ships with 25 built-in evaluators across six tag categories, and also supports custom evaluators — LLM-based or rule-based — that you define.

Built-in catalog

Name	Display name	Description	Tags
`answer_relevance`	Answer Relevance	Output directly addresses the user’s question or task	`quality`
`bias`	Bias Detection	Detects demographic, political, or cultural bias in outputs	`safety`, `quality`
`citation_correctness`	Citation Correctness	Citations point at real sources that support the cited claim	`quality`, `retrieval`
`context_utilization`	Context Utilization	Output meaningfully uses the relevant parts of provided context	`quality`, `retrieval`
`cost_threshold`	Cost Threshold	Flags traces where token cost exceeds a configured limit	`cost`, `quality`
`factual_accuracy`	Factual Accuracy	Verifies factual claims against widely known ground truth	`quality`
`faithfulness`	Faithfulness	Output is fully supported by the provided input context	`quality`, `retrieval`
`function_call_correctness`	Function Call	Correct tool chosen with valid arguments supplied	`tool-use`
`groundedness`	Groundedness	Output stays within the retrieved context; no free-floating generation	`quality`, `retrieval`
`hallucination`	Hallucination	Detects fabricated facts, invented citations, or unsupported claims	`quality`, `safety`
`jailbreak`	Jailbreak	Model did not comply with instructions to bypass its safety policies	`safety`
`json_schema_conformance`	JSON Schema	Output is valid JSON conforming to the expected schema	`tool-use`
`language_conformance`	Language Match	Output is written in the language requested by the input	`quality`
`latency_threshold`	Latency Threshold	Flags traces where wall-clock latency exceeds a configured limit	`performance`, `quality`
`multi_turn_consistency`	Multi-Turn Consistency	Conversation stays internally consistent across turns	`quality`
`output_format_compliance`	Format Compliance	Output conforms to the specified schema or structure	`tool-use`, `quality`
`pii_leakage`	PII Leakage	Detects personal identifiable information in outputs	`safety`
`prompt_injection`	Prompt Injection	Untrusted input did not override the system’s original instructions	`safety`
`prompt_injection_resistance`	Injection Resistance	Model successfully resisted embedded prompt injection attempts	`safety`
`refusal_appropriateness`	Refusal Check	Refusals are calibrated — neither over-refusing nor under-refusing	`safety`, `quality`
`retrieval_precision`	Retrieval Precision	Retrieved documents are relevant to the user’s query	`retrieval`
`retrieval_recall`	Retrieval Recall	Retrieved context contains enough information to fully answer the query	`retrieval`
`sentiment_match`	Sentiment Match	Output sentiment matches the expected sentiment label	`quality`
`tool_choice_correctness`	Tool Choice	Model selected the right tool for the task	`tool-use`
`toxicity`	Toxicity	Detects harmful, hateful, harassing, or abusive content	`safety`

All 25 are browsable — with rubrics, defaults, and sample outputs — at /evals/catalog in the dashboard.

Evaluator types

Type	How it works	When to use
`llm`	An LLM judges the trace against a rubric	Open-ended, nuanced metrics (correctness, tone)
`rule`	Deterministic code runs against the trace	Fast, cheap, reliable for structured checks (regex match, JSON schema validity, latency bounds)

Triggering an eval

Evals can be triggered three ways:

Automatically on ingest — configure evaluators to run on every trace matching a filter
On demand via the API — POST /v1/eval with a trace_id
In bulk from the dashboard — select a trace range and run an eval over them

import trulayer

client = trulayer.get_client()
result = client.run_eval(
    trace_id=trace_id,
    evaluator_type="llm",
    metric_name="correctness",
)
print(result.score, result.label, result.reasoning)

Result fields

Field	Type	Meaning
`score`	float	0.0 to 1.0 for normalized metrics; otherwise metric-defined
`label`	string	`pass` / `fail` / `warn`, or metric-defined
`reasoning`	string	For LLM evals: the evaluator’s explanation
`evaluator_version`	string	The evaluator that produced this result

Data flow for evals

LLM evals run the trace content through a judge model. Which model is called — and whose API key pays for the call — depends on your plan.

Tier	Evals available	Judge model	Whose API key
Starter	Yes (2,500 / month)	`claude-3-5-haiku` (Anthropic) or `gpt-4o-mini` (OpenAI), chosen by the evaluator	TruLayer’s platform key
Pro	Yes (50,000 / month)	`claude-3-5-haiku` (Anthropic) or `gpt-4o-mini` (OpenAI), chosen by the evaluator	TruLayer’s platform key
Team, Business, Enterprise	Yes	The judge model you configure with your BYOK credentials	Your API key (BYOK)

What this means at Starter and Pro tiers

When a Starter- or Pro-tier org triggers an LLM eval, TruLayer sends the trace input and output to the judge model using its own Anthropic or OpenAI API key. The trace content leaves TruLayer’s infrastructure on the TruLayer platform key’s account for the duration of the judge call and is subject to the judge provider’s data-processing terms. The dashboard surfaces this as a callout on the Evals page so the behaviour is visible before a run lands in production.

What this means at Team and above

Team and Enterprise plans support bring-your-own-key (BYOK). Configure your Anthropic or OpenAI credentials in Settings → Integrations and TruLayer will route every LLM eval — including the judge call — through your key. Trace content never leaves your provider account. Rule-based evals (JSON-schema, regex, latency bounds, etc.) run entirely inside TruLayer’s infrastructure at every tier — no third-party judge model, no external API call.

Eval rules

An eval rule is a standing instruction that tells TruLayer to run a specific evaluator on every matching trace automatically. Create and manage eval rules via the dashboard Evals → Evaluators tab or the POST /v1/eval-rules API. Eval rules are available on all plans. Starter tenants are limited to 3 eval rules per tenant. Attempting to create a fourth rule returns HTTP 402 with code: "eval_rules_quota_exceeded". Upgrade to Pro to remove the cap.

EvalRule fields

Field	Type	Required	Description
`id`	uuid	—	Rule identifier (assigned on create)
`project_id`	uuid	Yes	Project this rule applies to
`metric_name`	string	Yes	Name of the evaluator to run (e.g. `correctness`, `hallucination`)
`evaluator_type`	`llm` \| `rule`	Yes	Whether the evaluator is LLM-based or deterministic
`config`	object	No	Evaluator-specific options (e.g. rubric, threshold)
`enabled`	boolean	No	Whether the rule fires on new traces. Default: `true`
`model_tier`	`cheap` \| `premium`	No	Which judge model tier to use. Default: `cheap`
`min_score_drop`	number	No	Per-rule minimum score drop required to fire a `RemediationRegression` alert. See below.
`created_at`	date-time	—	Timestamp of creation
`updated_at`	date-time	—	Timestamp of last update

`min_score_drop` — per-rule regression threshold

min_score_drop controls how sensitive the RemediationRegression detector is for this specific eval rule. When the control loop re-evaluates a remediated trace, it compares the post-remediation score against the original score. If the score has dropped, a RemediationRegression failure event fires — unless the drop is smaller than min_score_drop, in which case the event is suppressed.

Value	Behaviour
`0.0` (default)	Any score decrease fires the alert
`0.05`	Drops smaller than 5 points are ignored
`0.10`	Drops smaller than 10 points are ignored
`null`	Same as `0.0` — uses the platform default

When to raise min_score_drop: LLM-based evaluators have inherent temperature noise — the same trace can score 0.82 one run and 0.79 the next with no real quality change. A 3-point variance triggering a regression alert creates noise in the failure pipeline. Setting min_score_drop: 0.05 to 0.10 on LLM-eval rules filters out this variance and keeps RemediationRegression alerts meaningful. Rule-based evaluators (evaluator_type: rule) are deterministic and do not exhibit temperature noise, so min_score_drop: 0.0 (the default) is appropriate for them. Setting the field:

# Set on create
curl -X POST https://api.trulayer.ai/v1/eval-rules \
  -H "Authorization: Bearer $TRULAYER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "01936f...",
    "metric_name": "correctness",
    "evaluator_type": "llm",
    "min_score_drop": 0.05
  }'

# Update an existing rule
curl -X PATCH https://api.trulayer.ai/v1/eval-rules/01936f... \
  -H "Authorization: Bearer $TRULAYER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"min_score_drop": 0.05}'

# Clear back to platform default (alert on any drop)
curl -X PATCH https://api.trulayer.ai/v1/eval-rules/01936f... \
  -H "Authorization: Bearer $TRULAYER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"min_score_drop": null}'

Negative values are rejected with HTTP 422. The field is accepted on POST /v1/eval-rules (create) and PATCH /v1/eval-rules/:id (update), and is returned on all eval-rule responses.

Regression testing

Evals over a dataset (a fixed set of traces) let you detect regressions between app versions. See the Dashboard guide for running datasets.

Getting started

Core concepts

Python SDK

TypeScript SDK

Go SDK

SDK features

Dashboard

Integrations

Control loop

Guides

Best practices

Reference

Contributing

Built-in catalog

Evaluator types

Triggering an eval

Result fields

Data flow for evals

What this means at Starter and Pro tiers

What this means at Team and above

Eval rules

EvalRule fields

`min_score_drop` — per-rule regression threshold

Regression testing

Getting started

Core concepts

Python SDK

TypeScript SDK

Go SDK

SDK features

Dashboard

Integrations

Control loop

Guides

Best practices

Reference

Contributing

Documentation Index

​Built-in catalog

​Evaluator types

​Triggering an eval

​Result fields

​Data flow for evals

​What this means at Starter and Pro tiers

​What this means at Team and above

​Eval rules

​EvalRule fields

​min_score_drop — per-rule regression threshold

​Regression testing

Built-in catalog

Evaluator types

Triggering an eval

Result fields

Data flow for evals

What this means at Starter and Pro tiers

What this means at Team and above

Eval rules

EvalRule fields

`min_score_drop` — per-rule regression threshold

Regression testing