Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.trulayer.ai/llms.txt

Use this file to discover all available pages before exploring further.

An eval is an automated judgement of a trace. TruLayer ships with 25 built-in evaluators across six tag categories, and also supports custom evaluators — LLM-based or rule-based — that you define.

Built-in catalog

NameDisplay nameDescriptionTags
answer_relevanceAnswer RelevanceOutput directly addresses the user’s question or taskquality
biasBias DetectionDetects demographic, political, or cultural bias in outputssafety, quality
citation_correctnessCitation CorrectnessCitations point at real sources that support the cited claimquality, retrieval
context_utilizationContext UtilizationOutput meaningfully uses the relevant parts of provided contextquality, retrieval
cost_thresholdCost ThresholdFlags traces where token cost exceeds a configured limitcost, quality
factual_accuracyFactual AccuracyVerifies factual claims against widely known ground truthquality
faithfulnessFaithfulnessOutput is fully supported by the provided input contextquality, retrieval
function_call_correctnessFunction CallCorrect tool chosen with valid arguments suppliedtool-use
groundednessGroundednessOutput stays within the retrieved context; no free-floating generationquality, retrieval
hallucinationHallucinationDetects fabricated facts, invented citations, or unsupported claimsquality, safety
jailbreakJailbreakModel did not comply with instructions to bypass its safety policiessafety
json_schema_conformanceJSON SchemaOutput is valid JSON conforming to the expected schematool-use
language_conformanceLanguage MatchOutput is written in the language requested by the inputquality
latency_thresholdLatency ThresholdFlags traces where wall-clock latency exceeds a configured limitperformance, quality
multi_turn_consistencyMulti-Turn ConsistencyConversation stays internally consistent across turnsquality
output_format_complianceFormat ComplianceOutput conforms to the specified schema or structuretool-use, quality
pii_leakagePII LeakageDetects personal identifiable information in outputssafety
prompt_injectionPrompt InjectionUntrusted input did not override the system’s original instructionssafety
prompt_injection_resistanceInjection ResistanceModel successfully resisted embedded prompt injection attemptssafety
refusal_appropriatenessRefusal CheckRefusals are calibrated — neither over-refusing nor under-refusingsafety, quality
retrieval_precisionRetrieval PrecisionRetrieved documents are relevant to the user’s queryretrieval
retrieval_recallRetrieval RecallRetrieved context contains enough information to fully answer the queryretrieval
sentiment_matchSentiment MatchOutput sentiment matches the expected sentiment labelquality
tool_choice_correctnessTool ChoiceModel selected the right tool for the tasktool-use
toxicityToxicityDetects harmful, hateful, harassing, or abusive contentsafety
All 25 are browsable — with rubrics, defaults, and sample outputs — at /evals/catalog in the dashboard.

Evaluator types

TypeHow it worksWhen to use
llmAn LLM judges the trace against a rubricOpen-ended, nuanced metrics (correctness, tone)
ruleDeterministic code runs against the traceFast, cheap, reliable for structured checks (regex match, JSON schema validity, latency bounds)

Triggering an eval

Evals can be triggered three ways:
  1. Automatically on ingest — configure evaluators to run on every trace matching a filter
  2. On demand via the API — POST /v1/eval with a trace_id
  3. In bulk from the dashboard — select a trace range and run an eval over them
import trulayer

client = trulayer.get_client()
result = client.run_eval(
    trace_id=trace_id,
    evaluator_type="llm",
    metric_name="correctness",
)
print(result.score, result.label, result.reasoning)

Result fields

FieldTypeMeaning
scorefloat0.0 to 1.0 for normalized metrics; otherwise metric-defined
labelstringpass / fail / warn, or metric-defined
reasoningstringFor LLM evals: the evaluator’s explanation
evaluator_versionstringThe evaluator that produced this result

Data flow for evals

LLM evals run the trace content through a judge model. Which model is called — and whose API key pays for the call — depends on your plan.
TierEvals availableJudge modelWhose API key
StarterYes (2,500 / month)claude-3-5-haiku (Anthropic) or gpt-4o-mini (OpenAI), chosen by the evaluatorTruLayer’s platform key
ProYes (50,000 / month)claude-3-5-haiku (Anthropic) or gpt-4o-mini (OpenAI), chosen by the evaluatorTruLayer’s platform key
Team, Business, EnterpriseYesThe judge model you configure with your BYOK credentialsYour API key (BYOK)

What this means at Starter and Pro tiers

When a Starter- or Pro-tier org triggers an LLM eval, TruLayer sends the trace input and output to the judge model using its own Anthropic or OpenAI API key. The trace content leaves TruLayer’s infrastructure on the TruLayer platform key’s account for the duration of the judge call and is subject to the judge provider’s data-processing terms. The dashboard surfaces this as a callout on the Evals page so the behaviour is visible before a run lands in production.

What this means at Team and above

Team and Enterprise plans support bring-your-own-key (BYOK). Configure your Anthropic or OpenAI credentials in Settings → Integrations and TruLayer will route every LLM eval — including the judge call — through your key. Trace content never leaves your provider account. Rule-based evals (JSON-schema, regex, latency bounds, etc.) run entirely inside TruLayer’s infrastructure at every tier — no third-party judge model, no external API call.

Eval rules

An eval rule is a standing instruction that tells TruLayer to run a specific evaluator on every matching trace automatically. Create and manage eval rules via the dashboard Evals → Evaluators tab or the POST /v1/eval-rules API. Eval rules are available on all plans. Starter tenants are limited to 3 eval rules per tenant. Attempting to create a fourth rule returns HTTP 402 with code: "eval_rules_quota_exceeded". Upgrade to Pro to remove the cap.

EvalRule fields

FieldTypeRequiredDescription
iduuidRule identifier (assigned on create)
project_iduuidYesProject this rule applies to
metric_namestringYesName of the evaluator to run (e.g. correctness, hallucination)
evaluator_typellm | ruleYesWhether the evaluator is LLM-based or deterministic
configobjectNoEvaluator-specific options (e.g. rubric, threshold)
enabledbooleanNoWhether the rule fires on new traces. Default: true
model_tiercheap | premiumNoWhich judge model tier to use. Default: cheap
min_score_dropnumberNoPer-rule minimum score drop required to fire a RemediationRegression alert. See below.
created_atdate-timeTimestamp of creation
updated_atdate-timeTimestamp of last update

min_score_drop — per-rule regression threshold

min_score_drop controls how sensitive the RemediationRegression detector is for this specific eval rule. When the control loop re-evaluates a remediated trace, it compares the post-remediation score against the original score. If the score has dropped, a RemediationRegression failure event fires — unless the drop is smaller than min_score_drop, in which case the event is suppressed.
ValueBehaviour
0.0 (default)Any score decrease fires the alert
0.05Drops smaller than 5 points are ignored
0.10Drops smaller than 10 points are ignored
nullSame as 0.0 — uses the platform default
When to raise min_score_drop: LLM-based evaluators have inherent temperature noise — the same trace can score 0.82 one run and 0.79 the next with no real quality change. A 3-point variance triggering a regression alert creates noise in the failure pipeline. Setting min_score_drop: 0.05 to 0.10 on LLM-eval rules filters out this variance and keeps RemediationRegression alerts meaningful. Rule-based evaluators (evaluator_type: rule) are deterministic and do not exhibit temperature noise, so min_score_drop: 0.0 (the default) is appropriate for them. Setting the field:
# Set on create
curl -X POST https://api.trulayer.ai/v1/eval-rules \
  -H "Authorization: Bearer $TRULAYER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "01936f...",
    "metric_name": "correctness",
    "evaluator_type": "llm",
    "min_score_drop": 0.05
  }'

# Update an existing rule
curl -X PATCH https://api.trulayer.ai/v1/eval-rules/01936f... \
  -H "Authorization: Bearer $TRULAYER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"min_score_drop": 0.05}'

# Clear back to platform default (alert on any drop)
curl -X PATCH https://api.trulayer.ai/v1/eval-rules/01936f... \
  -H "Authorization: Bearer $TRULAYER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"min_score_drop": null}'
Negative values are rejected with HTTP 422. The field is accepted on POST /v1/eval-rules (create) and PATCH /v1/eval-rules/:id (update), and is returned on all eval-rule responses.

Regression testing

Evals over a dataset (a fixed set of traces) let you detect regressions between app versions. See the Dashboard guide for running datasets.