Documentation Index
Fetch the complete documentation index at: https://docs.trulayer.ai/llms.txt
Use this file to discover all available pages before exploring further.
An eval is an automated judgement of a trace. TruLayer ships with 25 built-in evaluators across six tag categories, and also supports custom evaluators — LLM-based or rule-based — that you define.
Built-in catalog
| Name | Display name | Description | Tags |
|---|
answer_relevance | Answer Relevance | Output directly addresses the user’s question or task | quality |
bias | Bias Detection | Detects demographic, political, or cultural bias in outputs | safety, quality |
citation_correctness | Citation Correctness | Citations point at real sources that support the cited claim | quality, retrieval |
context_utilization | Context Utilization | Output meaningfully uses the relevant parts of provided context | quality, retrieval |
cost_threshold | Cost Threshold | Flags traces where token cost exceeds a configured limit | cost, quality |
factual_accuracy | Factual Accuracy | Verifies factual claims against widely known ground truth | quality |
faithfulness | Faithfulness | Output is fully supported by the provided input context | quality, retrieval |
function_call_correctness | Function Call | Correct tool chosen with valid arguments supplied | tool-use |
groundedness | Groundedness | Output stays within the retrieved context; no free-floating generation | quality, retrieval |
hallucination | Hallucination | Detects fabricated facts, invented citations, or unsupported claims | quality, safety |
jailbreak | Jailbreak | Model did not comply with instructions to bypass its safety policies | safety |
json_schema_conformance | JSON Schema | Output is valid JSON conforming to the expected schema | tool-use |
language_conformance | Language Match | Output is written in the language requested by the input | quality |
latency_threshold | Latency Threshold | Flags traces where wall-clock latency exceeds a configured limit | performance, quality |
multi_turn_consistency | Multi-Turn Consistency | Conversation stays internally consistent across turns | quality |
output_format_compliance | Format Compliance | Output conforms to the specified schema or structure | tool-use, quality |
pii_leakage | PII Leakage | Detects personal identifiable information in outputs | safety |
prompt_injection | Prompt Injection | Untrusted input did not override the system’s original instructions | safety |
prompt_injection_resistance | Injection Resistance | Model successfully resisted embedded prompt injection attempts | safety |
refusal_appropriateness | Refusal Check | Refusals are calibrated — neither over-refusing nor under-refusing | safety, quality |
retrieval_precision | Retrieval Precision | Retrieved documents are relevant to the user’s query | retrieval |
retrieval_recall | Retrieval Recall | Retrieved context contains enough information to fully answer the query | retrieval |
sentiment_match | Sentiment Match | Output sentiment matches the expected sentiment label | quality |
tool_choice_correctness | Tool Choice | Model selected the right tool for the task | tool-use |
toxicity | Toxicity | Detects harmful, hateful, harassing, or abusive content | safety |
All 25 are browsable — with rubrics, defaults, and sample outputs — at /evals/catalog in the dashboard.
Evaluator types
| Type | How it works | When to use |
|---|
llm | An LLM judges the trace against a rubric | Open-ended, nuanced metrics (correctness, tone) |
rule | Deterministic code runs against the trace | Fast, cheap, reliable for structured checks (regex match, JSON schema validity, latency bounds) |
Triggering an eval
Evals can be triggered three ways:
- Automatically on ingest — configure evaluators to run on every trace matching a filter
- On demand via the API —
POST /v1/eval with a trace_id
- In bulk from the dashboard — select a trace range and run an eval over them
import trulayer
client = trulayer.get_client()
result = client.run_eval(
trace_id=trace_id,
evaluator_type="llm",
metric_name="correctness",
)
print(result.score, result.label, result.reasoning)
Result fields
| Field | Type | Meaning |
|---|
score | float | 0.0 to 1.0 for normalized metrics; otherwise metric-defined |
label | string | pass / fail / warn, or metric-defined |
reasoning | string | For LLM evals: the evaluator’s explanation |
evaluator_version | string | The evaluator that produced this result |
Data flow for evals
LLM evals run the trace content through a judge model. Which model is called — and whose API key pays for the call — depends on your plan.
| Tier | Evals available | Judge model | Whose API key |
|---|
| Starter | Yes (2,500 / month) | claude-3-5-haiku (Anthropic) or gpt-4o-mini (OpenAI), chosen by the evaluator | TruLayer’s platform key |
| Pro | Yes (50,000 / month) | claude-3-5-haiku (Anthropic) or gpt-4o-mini (OpenAI), chosen by the evaluator | TruLayer’s platform key |
| Team, Business, Enterprise | Yes | The judge model you configure with your BYOK credentials | Your API key (BYOK) |
What this means at Starter and Pro tiers
When a Starter- or Pro-tier org triggers an LLM eval, TruLayer sends the trace input and output to the judge model using its own Anthropic or OpenAI API key. The trace content leaves TruLayer’s infrastructure on the TruLayer platform key’s account for the duration of the judge call and is subject to the judge provider’s data-processing terms.
The dashboard surfaces this as a callout on the Evals page so the behaviour is visible before a run lands in production.
What this means at Team and above
Team and Enterprise plans support bring-your-own-key (BYOK). Configure your Anthropic or OpenAI credentials in Settings → Integrations and TruLayer will route every LLM eval — including the judge call — through your key. Trace content never leaves your provider account.
Rule-based evals (JSON-schema, regex, latency bounds, etc.) run entirely inside TruLayer’s infrastructure at every tier — no third-party judge model, no external API call.
Eval rules
An eval rule is a standing instruction that tells TruLayer to run a specific evaluator on every matching trace automatically. Create and manage eval rules via the dashboard Evals → Evaluators tab or the POST /v1/eval-rules API.
Eval rules are available on all plans. Starter tenants are limited to 3 eval rules per tenant. Attempting to create a fourth rule returns HTTP 402 with code: "eval_rules_quota_exceeded". Upgrade to Pro to remove the cap.
EvalRule fields
| Field | Type | Required | Description |
|---|
id | uuid | — | Rule identifier (assigned on create) |
project_id | uuid | Yes | Project this rule applies to |
metric_name | string | Yes | Name of the evaluator to run (e.g. correctness, hallucination) |
evaluator_type | llm | rule | Yes | Whether the evaluator is LLM-based or deterministic |
config | object | No | Evaluator-specific options (e.g. rubric, threshold) |
enabled | boolean | No | Whether the rule fires on new traces. Default: true |
model_tier | cheap | premium | No | Which judge model tier to use. Default: cheap |
min_score_drop | number | No | Per-rule minimum score drop required to fire a RemediationRegression alert. See below. |
created_at | date-time | — | Timestamp of creation |
updated_at | date-time | — | Timestamp of last update |
min_score_drop — per-rule regression threshold
min_score_drop controls how sensitive the RemediationRegression detector is for this specific eval rule.
When the control loop re-evaluates a remediated trace, it compares the post-remediation score against the original score. If the score has dropped, a RemediationRegression failure event fires — unless the drop is smaller than min_score_drop, in which case the event is suppressed.
| Value | Behaviour |
|---|
0.0 (default) | Any score decrease fires the alert |
0.05 | Drops smaller than 5 points are ignored |
0.10 | Drops smaller than 10 points are ignored |
null | Same as 0.0 — uses the platform default |
When to raise min_score_drop: LLM-based evaluators have inherent temperature noise — the same trace can score 0.82 one run and 0.79 the next with no real quality change. A 3-point variance triggering a regression alert creates noise in the failure pipeline. Setting min_score_drop: 0.05 to 0.10 on LLM-eval rules filters out this variance and keeps RemediationRegression alerts meaningful.
Rule-based evaluators (evaluator_type: rule) are deterministic and do not exhibit temperature noise, so min_score_drop: 0.0 (the default) is appropriate for them.
Setting the field:
# Set on create
curl -X POST https://api.trulayer.ai/v1/eval-rules \
-H "Authorization: Bearer $TRULAYER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"project_id": "01936f...",
"metric_name": "correctness",
"evaluator_type": "llm",
"min_score_drop": 0.05
}'
# Update an existing rule
curl -X PATCH https://api.trulayer.ai/v1/eval-rules/01936f... \
-H "Authorization: Bearer $TRULAYER_API_KEY" \
-H "Content-Type: application/json" \
-d '{"min_score_drop": 0.05}'
# Clear back to platform default (alert on any drop)
curl -X PATCH https://api.trulayer.ai/v1/eval-rules/01936f... \
-H "Authorization: Bearer $TRULAYER_API_KEY" \
-H "Content-Type: application/json" \
-d '{"min_score_drop": null}'
Negative values are rejected with HTTP 422. The field is accepted on POST /v1/eval-rules (create) and PATCH /v1/eval-rules/:id (update), and is returned on all eval-rule responses.
Regression testing
Evals over a dataset (a fixed set of traces) let you detect regressions between app versions. See the Dashboard guide for running datasets.