ChatGPT Prompt for Prompt Optimization & Evals
Design an eval harness for bug root-cause analysis using JSON schema validation that tracks F1 score across prompt versions on Llama 3.3 70B.
More prompts for Prompt Optimization & Evals.
Run a rigorous A/B test on prompt variants for API design decisions, measuring cost-per-correct-answer on Claude Opus 4.5 using rubric scoring.
Design an eval harness for bug root-cause analysis using BLEU/ROUGE that tracks token cost across prompt versions on Llama 3.3 70B.
Design an eval harness for bug root-cause analysis using DeepEval metrics that tracks refusal rate across prompt versions on GPT-4o.
Token-cost and latency reduction playbook for a academic grading prompt running on Claude Opus 4.5, judged by human pairwise comparison.
Run a rigorous A/B test on prompt variants for legal brief summarization, measuring hallucination rate on o1-mini using promptfoo assertions.
Run a rigorous A/B test on prompt variants for API design decisions, measuring toolcall precision on GPT-4o-mini using Trulens feedback functions.
You are the owner of the eval harness for a team shipping an LLM feature that does bug root-cause analysis on Llama 3.3 70B. Your harness needs to be strict enough that people trust it, cheap enough that they run it, and flexible enough that they extend it.
## What you are building
A reusable eval harness with these responsibilities:
1. Load a versioned dataset of bug root-cause analysis examples sourced from regression regression suite.
2. Run any registered prompt variant against Llama 3.3 70B with pinned decoding params.
3. Score each output using JSON schema validation against a per-example ground truth or rubric.
4. Log metrics, especially F1 score, and guardrail metrics (refusal rate, format compliance, safety).
5. Produce a diff report between two variants.
6. Be runnable both in CI (on every prompt PR) and ad-hoc locally.
## Deliverable
Produce a complete design doc with the following sections:
### Architecture
A sketch (text is fine) of:
```
Dataset v{N} → Runner → Model call → Output → Judge → Metrics store → Report
↑ ↓
Prompt registry CI gate (pass/fail)
```
### Dataset spec
- Schema: { id, input, expected, stratum, tags, source_url, created_at, retired_at }
- Sourcing plan from regression regression suite
- Refresh cadence (how often to add new examples from production)
- Retirement policy (when examples become stale)
- Sampling strategy for CI (small fast set) vs. full (slow, nightly)
### Runner spec
- How to pin Llama 3.3 70B version (include exact version string)
- Decoding params are stored alongside the prompt, not hard-coded
- Retry + timeout behavior
- Caching: runs are deterministic by (prompt_hash, example_id, model_version, decoding_hash)
### Judging spec — using JSON schema validation
- Define the scoring procedure precisely.
- If JSON schema validation is an LLM, pin the judge model (different from the model under test) and publish the judge prompt — treat it as a first-class artifact.
- Calibrate JSON schema validation against a small human-labeled set; report inter-judge agreement (κ) before trusting it.
- Flakiness mitigation: average 3 judge runs or use majority-vote if variance is high.
### Metrics
- Primary: F1 score
- Guardrails: refusal_rate, format_compliance, safety_violations, p95_latency_ms, mean_tokens, $/example
- Per-stratum slices
### Reporting
Example report table (Markdown):
| variant | F1 score | refusal% | format% | p95_ms | $/ex |
| --- | --- | --- | --- | --- | --- |
Plus a "Biggest disagreements" section for qualitative review.
### CI gating
- PRs that modify a prompt file must include an eval run.
- Block the PR if F1 score drops >2% OR any guardrail crosses its threshold.
- Override requires explicit approver and a written justification committed to the PR.
### Code sketch
Provide a ~40-line Python skeleton using plain stdlib + `anthropic` or `openai` client. No fancy frameworks. Functions: `load_dataset`, `run_variant`, `judge`, `score`, `report`, `gate`.
## Constraints
- Don't recommend a paid SaaS eval platform unless the team already uses it.
- Don't let judge prompts live un-versioned.
- Keep the first working version buildable in one afternoon.