Prompts/AI Engineering & LLM Apps/Evals & Observability

FreeAI Engineering & LLM Apps🟠 Claude

GPT-4.1 rubric scorer LLM-as-Judge Rubric for long-doc QA

Claude Prompt for Evals & Observability

Design a pairwise + rubric LLM-as-judge prompt for long-doc QA with bias mitigation, calibration, and reproducibility.

Related prompts

More prompts for Evals & Observability.

Browse all AI Engineering & LLM Apps →

AI Engineering & LLM Apps

Free

Trace Analysis Playbook for structured extraction LLM App in Lunary (TypeScript)

Instrument, query, and triage structured extraction LLM app traces in Lunary with TypeScript SDK, covering latency, cost, and quality dashboards.

🟠Claude

1911517

AI Engineering & LLM Apps

Free

Trace Analysis Playbook for classification pipeline LLM App in OpenTelemetry + Jaeger (Ruby)

Instrument, query, and triage classification pipeline LLM app traces in OpenTelemetry + Jaeger with Ruby SDK, covering latency, cost, and quality dashboards.

🤖Any Model

521515

AI Engineering & LLM Apps

Premium

Trace Analysis Playbook for agent with tool-use LLM App in Galileo (Java)

Instrument, query, and triage agent with tool-use LLM app traces in Galileo with Java SDK, covering latency, cost, and quality dashboards.

🤖Any Model

1841514

AI Engineering & LLM Apps

Free

Trace Analysis Playbook for code-completion copilot LLM App in OpenTelemetry + Jaeger (Java)

Instrument, query, and triage code-completion copilot LLM app traces in OpenTelemetry + Jaeger with Java SDK, covering latency, cost, and quality dashboards.

💬ChatGPT

1901513

AI Engineering & LLM Apps

Premium

G-Eval with Gemini 2.5 Pro LLM-as-Judge Rubric for multi-turn dialogue

Design a pairwise + rubric LLM-as-judge prompt for multi-turn dialogue with bias mitigation, calibration, and reproducibility.

🤖Any Model

3751512

AI Engineering & LLM Apps

Premium

DeepEval correctness judge LLM-as-Judge Rubric for SQL generation

Design a pairwise + rubric LLM-as-judge prompt for SQL generation with bias mitigation, calibration, and reproducibility.

💬ChatGPT

3091508

You are an evals lead. Build an LLM-as-judge system that is reliable enough to be the PRIMARY signal for shipping decisions on long-doc QA. ## Why LLM-as-Judge Human eval is gold but expensive and slow. Automated metrics (BLEU, ROUGE, exact-match) miss the point for generative tasks. LLM-as-judge fills the gap when designed carefully. Designed carelessly, it introduces subtle biases that invalidate shipping decisions. ## Judge Model: GPT-4.1 rubric scorer - Typically use a stronger model than the one being evaluated. Here: GPT-4.1 rubric scorer. - Do not judge a model with itself (self-preference bias). - Fix the judge version via pinned API version or checkpoint hash for reproducibility. ## Rubric Design ### Dimensions For long-doc QA, evaluate along 7 dimensions. Each dimension should be: - Independently meaningful (a response could score high on one and low on another) - Observable from text alone (not requiring external facts) - Described with concrete anchor examples for each score level Dimensions for long-doc QA: 1. **Correctness** — factually accurate given the input 2. **Completeness** — addresses all parts of the prompt 3. **Format** — follows requested format 4. **Clarity** — readable, well-organized 5. **Appropriate Caution** — expresses uncertainty where warranted; refuses when required 6. **[Additional task-specific dimension for long-doc QA]** ### Scale Use a 5-point Likert scale (1-5) per dimension. Avoid 3-point (too coarse) and 7+ point (noisy without enough training). Anchor each score with 1-2 example responses of that quality level (concrete, for long-doc QA). ## Judge Prompt ``` You are an expert evaluator for long-doc QA model outputs. TASK PROMPT: {task_prompt} MODEL RESPONSE: {model_response} [REFERENCE ANSWER (optional): {reference}] Evaluate the MODEL RESPONSE on each dimension. For each, first explain your reasoning in 1-2 sentences, then give a score 1-5. Dimensions and anchors: {dimension_anchors} Output JSON: { "reasoning": "chain-of-thought across all dimensions", "scores": { "correctness": { "score": 1-5, "evidence": "..." }, "completeness": { "score": 1-5, "evidence": "..." }, "format": { "score": 1-5, "evidence": "..." }, "clarity": { "score": 1-5, "evidence": "..." }, "caution": { "score": 1-5, "evidence": "..." } }, "overall_pass": true/false, "issues": ["specific problem 1", "specific problem 2"] } ``` Key design choices: - **Reasoning first, scores last.** Improves calibration. - **Evidence per score.** Forces grounding, enables audit. - **Randomize example order** when comparing two responses. - **Temperature = 0.** Reproducibility. - **No model-identifying hints** in the prompt (e.g., "this was generated by GPT" biases the judge). ## Pairwise Variant For comparing two models (A vs B), use pairwise preference: ``` Compare RESPONSE A and RESPONSE B. You MUST pick one: - A_better - B_better - tie (only if genuinely indistinguishable) - both_bad (only if both fail basic quality) Explain your reasoning. Output JSON. ``` **Position bias mitigation:** for each pair, run the judge TWICE — once with (A, B), once with (B, A). Accept only if the verdict is consistent after swapping. Non-consistent → tie. ## Biases to Mitigate | Bias | Symptom | Mitigation | |---|---|---| | Position | Judge favors first/last response | Swap + require consistency | | Length | Judge favors longer response | Report length-matched win rate too | | Style | Judge favors markdown / bullet lists | Judge prompt explicitly says style is secondary to substance | | Self-preference | Model rates itself highly | Never judge with same model | | Authority | Judge favors confident wording | Anchor examples show confident-but-wrong as low score | | Sycophancy | Judge agrees with first opinion | Require evidence per score | ## Calibration Before trusting the judge, calibrate against human labels: 1. Collect 250 task outputs with human ratings (ideally 3+ raters per item) 2. Run the judge over the same outputs 3. Compute: - Pearson correlation per dimension (target ≥ 0.7) - Cohen's kappa for pairwise verdicts vs human consensus (target ≥ 0.6) - Confusion matrix for discrete scores 4. If correlation is low, iterate on the rubric — usually the fix is sharper anchor examples, not a stronger judge model ## Cost & Latency - GPT-4.1 rubric scorer avg: 3s per eval, $0.03 per eval - Eval set of 500 examples = $25 per run - Budget: ~50 runs per week → $500/week ## Integration ### Offline Eval - CI job runs on every merge to main - Fails the build if any key metric drops > 2 pts vs last green - Report posted as PR comment with before/after table ### Online Eval - Shadow-score 5% of production traffic with the judge - Aggregate by user segment, prompt type, time window - Anomaly alerts when aggregate score drops by 2σ ## Reproducibility Every eval run records: - Judge model + version - Rubric version hash - Prompt version - Eval set version hash - Random seeds - Judge latency + cost per item Results stored with `eval_run_id` in Humanloop for drill-down. ## Golden Set Curation Human-labeled golden set of 1000 examples, stratified by: - Task subtype - Difficulty (easy / medium / hard) - Source (synthetic / real user / adversarial) Refresh every monthly to prevent overfitting. Quarantine 10% as a frozen ever-unseen set used only for final pre-release validation. ## Deliverables 1. Rubric doc with anchor examples per dimension × score 2. Judge prompt template 3. Calibration notebook showing human-judge correlation 4. Eval harness: takes (model_fn, eval_set) → results JSON 5. CI integration snippet 6. Humanloop dashboard: aggregate scores over time, drill by dimension, pairwise win-rate matrix Organize your output using a clear framework with labeled sections. Each section should build on the previous one.

How to customize this prompt

Replace the bracketed placeholders with your own context before running the prompt:

[Additional task-specific dimension for long-doc QA]— fill in your specific additional task-specific dimension for long-doc qa.

[REFERENCE ANSWER (optional): {reference}]— fill in your specific reference answer (optional): {reference}.

["specific problem 1", "specific problem 2"]— fill in your specific "specific problem 1", "specific problem 2".

GPT-4.1 rubric scorer LLM-as-Judge Rubric for long-doc QA

Related prompts

Trace Analysis Playbook for structured extraction LLM App in Lunary (TypeScript)

Trace Analysis Playbook for classification pipeline LLM App in OpenTelemetry + Jaeger (Ruby)

Trace Analysis Playbook for agent with tool-use LLM App in Galileo (Java)

Trace Analysis Playbook for code-completion copilot LLM App in OpenTelemetry + Jaeger (Java)

G-Eval with Gemini 2.5 Pro LLM-as-Judge Rubric for multi-turn dialogue

DeepEval correctness judge LLM-as-Judge Rubric for SQL generation

GPT-4.1 rubric scorer LLM-as-Judge Rubric for long-doc QA

Related prompts

Trace Analysis Playbook for structured extraction LLM App in Lunary (TypeScript)

Trace Analysis Playbook for classification pipeline LLM App in OpenTelemetry + Jaeger (Ruby)

Trace Analysis Playbook for agent with tool-use LLM App in Galileo (Java)

Trace Analysis Playbook for code-completion copilot LLM App in OpenTelemetry + Jaeger (Java)

G-Eval with Gemini 2.5 Pro LLM-as-Judge Rubric for multi-turn dialogue

DeepEval correctness judge LLM-as-Judge Rubric for SQL generation

How to customize this prompt

Tags

Who this is for