ChatGPT Prompt for Prompt Optimization & Evals
Design an eval harness for bug root-cause analysis using regex match checks that tracks factuality across prompt versions on Grok 3.
More prompts for Prompt Optimization & Evals.
Run a rigorous A/B test on prompt variants for API design decisions, measuring cost-per-correct-answer on Claude Opus 4.5 using rubric scoring.
Design an eval harness for bug root-cause analysis using BLEU/ROUGE that tracks token cost across prompt versions on Llama 3.3 70B.
Design an eval harness for bug root-cause analysis using DeepEval metrics that tracks refusal rate across prompt versions on GPT-4o.
Token-cost and latency reduction playbook for a academic grading prompt running on Claude Opus 4.5, judged by human pairwise comparison.
Run a rigorous A/B test on prompt variants for legal brief summarization, measuring hallucination rate on o1-mini using promptfoo assertions.
Run a rigorous A/B test on prompt variants for API design decisions, measuring toolcall precision on GPT-4o-mini using Trulens feedback functions.
You are the owner of the eval harness for a team shipping an LLM feature that does bug root-cause analysis on Grok 3. Your harness needs to be strict enough that people trust it, cheap enough that they run it, and flexible enough that they extend it.
## What you are building
A reusable eval harness with these responsibilities:
1. Load a versioned dataset of bug root-cause analysis examples sourced from regression regression suite.
2. Run any registered prompt variant against Grok 3 with pinned decoding params.
3. Score each output using regex match checks against a per-example ground truth or rubric.
4. Log metrics, especially factuality, and guardrail metrics (refusal rate, format compliance, safety).
5. Produce a diff report between two variants.
6. Be runnable both in CI (on every prompt PR) and ad-hoc locally.
## Deliverable
Produce a complete design doc with the following sections:
### Architecture
A sketch (text is fine) of:
```
Dataset v{N} → Runner → Model call → Output → Judge → Metrics store → Report
↑ ↓
Prompt registry CI gate (pass/fail)
```
### Dataset spec
- Schema: { id, input, expected, stratum, tags, source_url, created_at, retired_at }
- Sourcing plan from regression regression suite
- Refresh cadence (how often to add new examples from production)
- Retirement policy (when examples become stale)
- Sampling strategy for CI (small fast set) vs. full (slow, nightly)
### Runner spec
- How to pin Grok 3 version (include exact version string)
- Decoding params are stored alongside the prompt, not hard-coded
- Retry + timeout behavior
- Caching: runs are deterministic by (prompt_hash, example_id, model_version, decoding_hash)
### Judging spec — using regex match checks
- Define the scoring procedure precisely.
- If regex match checks is an LLM, pin the judge model (different from the model under test) and publish the judge prompt — treat it as a first-class artifact.
- Calibrate regex match checks against a small human-labeled set; report inter-judge agreement (κ) before trusting it.
- Flakiness mitigation: average 3 judge runs or use majority-vote if variance is high.
### Metrics
- Primary: factuality
- Guardrails: refusal_rate, format_compliance, safety_violations, p95_latency_ms, mean_tokens, $/example
- Per-stratum slices
### Reporting
Example report table (Markdown):
| variant | factuality | refusal% | format% | p95_ms | $/ex |
| --- | --- | --- | --- | --- | --- |
Plus a "Biggest disagreements" section for qualitative review.
### CI gating
- PRs that modify a prompt file must include an eval run.
- Block the PR if factuality drops >2% OR any guardrail crosses its threshold.
- Override requires explicit approver and a written justification committed to the PR.
### Code sketch
Provide a ~40-line Python skeleton using plain stdlib + `anthropic` or `openai` client. No fancy frameworks. Functions: `load_dataset`, `run_variant`, `judge`, `score`, `report`, `gate`.
## Constraints
- Don't recommend a paid SaaS eval platform unless the team already uses it.
- Don't let judge prompts live un-versioned.
- Keep the first working version buildable in one afternoon.