Claude Prompt for Evals & Observability
Design a pairwise + rubric LLM-as-judge prompt for long-doc QA with bias mitigation, calibration, and reproducibility.
More prompts for Evals & Observability.
Instrument, query, and triage structured extraction LLM app traces in Lunary with TypeScript SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage classification pipeline LLM app traces in OpenTelemetry + Jaeger with Ruby SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage agent with tool-use LLM app traces in Galileo with Java SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage code-completion copilot LLM app traces in OpenTelemetry + Jaeger with Java SDK, covering latency, cost, and quality dashboards.
Design a pairwise + rubric LLM-as-judge prompt for multi-turn dialogue with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for SQL generation with bias mitigation, calibration, and reproducibility.
You are an evals lead. Build an LLM-as-judge system that is reliable enough to be the PRIMARY signal for shipping decisions on long-doc QA.
## Why LLM-as-Judge
Human eval is gold but expensive and slow. Automated metrics (BLEU, ROUGE, exact-match) miss the point for generative tasks. LLM-as-judge fills the gap when designed carefully. Designed carelessly, it introduces subtle biases that invalidate shipping decisions.
## Judge Model: GPT-4.1 rubric scorer
- Typically use a stronger model than the one being evaluated. Here: GPT-4.1 rubric scorer.
- Do not judge a model with itself (self-preference bias).
- Fix the judge version via pinned API version or checkpoint hash for reproducibility.
## Rubric Design
### Dimensions
For long-doc QA, evaluate along 7 dimensions. Each dimension should be:
- Independently meaningful (a response could score high on one and low on another)
- Observable from text alone (not requiring external facts)
- Described with concrete anchor examples for each score level
Dimensions for long-doc QA:
1. **Correctness** — factually accurate given the input
2. **Completeness** — addresses all parts of the prompt
3. **Format** — follows requested format
4. **Clarity** — readable, well-organized
5. **Appropriate Caution** — expresses uncertainty where warranted; refuses when required
6. **[Additional task-specific dimension for long-doc QA]**
### Scale
Use a 5-point Likert scale (1-5) per dimension. Avoid 3-point (too coarse) and 7+ point (noisy without enough training).
Anchor each score with 1-2 example responses of that quality level (concrete, for long-doc QA).
## Judge Prompt
```
You are an expert evaluator for long-doc QA model outputs.
TASK PROMPT: {task_prompt}
MODEL RESPONSE: {model_response}
[REFERENCE ANSWER (optional): {reference}]
Evaluate the MODEL RESPONSE on each dimension. For each, first explain your reasoning in 1-2 sentences, then give a score 1-5.
Dimensions and anchors:
{dimension_anchors}
Output JSON:
{
"reasoning": "chain-of-thought across all dimensions",
"scores": {
"correctness": { "score": 1-5, "evidence": "..." },
"completeness": { "score": 1-5, "evidence": "..." },
"format": { "score": 1-5, "evidence": "..." },
"clarity": { "score": 1-5, "evidence": "..." },
"caution": { "score": 1-5, "evidence": "..." }
},
"overall_pass": true/false,
"issues": ["specific problem 1", "specific problem 2"]
}
```
Key design choices:
- **Reasoning first, scores last.** Improves calibration.
- **Evidence per score.** Forces grounding, enables audit.
- **Randomize example order** when comparing two responses.
- **Temperature = 0.** Reproducibility.
- **No model-identifying hints** in the prompt (e.g., "this was generated by GPT" biases the judge).
## Pairwise Variant
For comparing two models (A vs B), use pairwise preference:
```
Compare RESPONSE A and RESPONSE B. You MUST pick one:
- A_better
- B_better
- tie (only if genuinely indistinguishable)
- both_bad (only if both fail basic quality)
Explain your reasoning. Output JSON.
```
**Position bias mitigation:** for each pair, run the judge TWICE — once with (A, B), once with (B, A). Accept only if the verdict is consistent after swapping. Non-consistent → tie.
## Biases to Mitigate
| Bias | Symptom | Mitigation |
|---|---|---|
| Position | Judge favors first/last response | Swap + require consistency |
| Length | Judge favors longer response | Report length-matched win rate too |
| Style | Judge favors markdown / bullet lists | Judge prompt explicitly says style is secondary to substance |
| Self-preference | Model rates itself highly | Never judge with same model |
| Authority | Judge favors confident wording | Anchor examples show confident-but-wrong as low score |
| Sycophancy | Judge agrees with first opinion | Require evidence per score |
## Calibration
Before trusting the judge, calibrate against human labels:
1. Collect 250 task outputs with human ratings (ideally 3+ raters per item)
2. Run the judge over the same outputs
3. Compute:
- Pearson correlation per dimension (target ≥ 0.7)
- Cohen's kappa for pairwise verdicts vs human consensus (target ≥ 0.6)
- Confusion matrix for discrete scores
4. If correlation is low, iterate on the rubric — usually the fix is sharper anchor examples, not a stronger judge model
## Cost & Latency
- GPT-4.1 rubric scorer avg: 3s per eval, $0.03 per eval
- Eval set of 500 examples = $25 per run
- Budget: ~50 runs per week → $500/week
## Integration
### Offline Eval
- CI job runs on every merge to main
- Fails the build if any key metric drops > 2 pts vs last green
- Report posted as PR comment with before/after table
### Online Eval
- Shadow-score 5% of production traffic with the judge
- Aggregate by user segment, prompt type, time window
- Anomaly alerts when aggregate score drops by 2σ
## Reproducibility
Every eval run records:
- Judge model + version
- Rubric version hash
- Prompt version
- Eval set version hash
- Random seeds
- Judge latency + cost per item
Results stored with `eval_run_id` in Humanloop for drill-down.
## Golden Set Curation
Human-labeled golden set of 1000 examples, stratified by:
- Task subtype
- Difficulty (easy / medium / hard)
- Source (synthetic / real user / adversarial)
Refresh every monthly to prevent overfitting. Quarantine 10% as a frozen ever-unseen set used only for final pre-release validation.
## Deliverables
1. Rubric doc with anchor examples per dimension × score
2. Judge prompt template
3. Calibration notebook showing human-judge correlation
4. Eval harness: takes (model_fn, eval_set) → results JSON
5. CI integration snippet
6. Humanloop dashboard: aggregate scores over time, drill by dimension, pairwise win-rate matrix
Organize your output using a clear framework with labeled sections. Each section should build on the previous one.Replace the bracketed placeholders with your own context before running the prompt:
[Additional task-specific dimension for long-doc QA]— fill in your specific additional task-specific dimension for long-doc qa.[REFERENCE ANSWER (optional): {reference}]— fill in your specific reference answer (optional): {reference}.["specific problem 1", "specific problem 2"]— fill in your specific "specific problem 1", "specific problem 2".