ChatGPT Prompt for Prompt Optimization & Evals
Run a rigorous A/B test on prompt variants for scientific literature review, measuring factuality on o1-mini using promptfoo assertions.
More prompts for Prompt Optimization & Evals.
Run a rigorous A/B test on prompt variants for API design decisions, measuring cost-per-correct-answer on Claude Opus 4.5 using rubric scoring.
Design an eval harness for bug root-cause analysis using BLEU/ROUGE that tracks token cost across prompt versions on Llama 3.3 70B.
Design an eval harness for bug root-cause analysis using DeepEval metrics that tracks refusal rate across prompt versions on GPT-4o.
Token-cost and latency reduction playbook for a academic grading prompt running on Claude Opus 4.5, judged by human pairwise comparison.
Run a rigorous A/B test on prompt variants for legal brief summarization, measuring hallucination rate on o1-mini using promptfoo assertions.
Run a rigorous A/B test on prompt variants for API design decisions, measuring toolcall precision on GPT-4o-mini using Trulens feedback functions.
You are running a prompt A/B test for scientific literature review on o1-mini. The goal is a defensible, shippable decision — not a vibes-based "B felt better". ## Setup - **Variant A**: the current production prompt. - **Variant B**: the proposed challenger. - **Task**: scientific literature review - **Model**: o1-mini - **Primary metric**: factuality - **Judge**: promptfoo assertions - **Dataset source**: production logs ## Your job, step by step ### Step 1 — Pre-register the test Before running anything, write down: - Hypothesis: "B will beat A on factuality by at least X%, without regressing the guardrail metrics." - Sample size: how many examples (justify using variance of prior runs if available, else default 300). - Guardrail metrics (must NOT get worse): refusal rate, format-compliance rate, safety-policy violations, p95 latency. - Decision rule: ship B if (lift on factuality ≥ X%) AND (no guardrail regression beyond Y%) AND (effect is significant at p<0.05 via paired bootstrap). ### Step 2 — Build the eval dataset Sample from production logs. Stratify to cover: - Easy / medium / hard scientific literature review inputs - Known edge cases (empty input, very long input, inputs with injected instructions) - Historical regressions (inputs that previously broke the system) Tag each example with its stratum so you can slice results. ### Step 3 — Run both variants - Same inputs to both. - Same decoding params. - Same judging pipeline. - Capture: output, reasoning (if applicable), latency, token counts, tool calls. ### Step 4 — Score with promptfoo assertions - For LLM-as-judge, use pairwise preference with randomized order to avoid position bias. - For exact-match / regex, use strict matching. - For rubric scoring, publish the rubric and use 2 independent judges, report inter-judge agreement (Cohen's κ). ### Step 5 — Analyze Produce a report with: 1. Overall lift on factuality with 95% CI (paired bootstrap). 2. Per-stratum lift (easy/medium/hard) — is the win uniform or concentrated? 3. Guardrail metric deltas. 4. Cost delta (tokens × $/token for o1-mini). 5. Latency delta (p50, p95). 6. A table of the 10 most-different outputs (where A and B disagreed most), with your take on which is better and why. ### Step 6 — Write the decision memo (1 page) - Decision: SHIP / HOLD / REDO - Why - What would change your mind (explicit) - What you'd test next ## Anti-patterns to avoid - Cherry-picking examples to make B look good. - Running B with different decoding params "to be fair". - Declaring victory on a 2% lift with N=30. - Ignoring cost or latency regressions because "accuracy is what matters". - Forgetting the guardrails. Deliverable: the full report in Markdown, tables rendered cleanly, no emoji, no marketing tone.