ChatGPT Prompt for Prompt Optimization & Evals
Use manual grid search over temperature+system to optimize a customer support routing prompt on o3-mini against user satisfaction (CSAT) without regressing safety.
More prompts for Prompt Optimization & Evals.
Run a rigorous A/B test on prompt variants for API design decisions, measuring cost-per-correct-answer on Claude Opus 4.5 using rubric scoring.
Design an eval harness for bug root-cause analysis using BLEU/ROUGE that tracks token cost across prompt versions on Llama 3.3 70B.
Design an eval harness for bug root-cause analysis using DeepEval metrics that tracks refusal rate across prompt versions on GPT-4o.
Token-cost and latency reduction playbook for a academic grading prompt running on Claude Opus 4.5, judged by human pairwise comparison.
Run a rigorous A/B test on prompt variants for legal brief summarization, measuring hallucination rate on o1-mini using promptfoo assertions.
Run a rigorous A/B test on prompt variants for API design decisions, measuring toolcall precision on GPT-4o-mini using Trulens feedback functions.
You are running an automated prompt-optimization campaign using manual grid search over temperature+system to improve user satisfaction (CSAT) for a customer support routing prompt on o3-mini. ## Inputs you have - A current prompt (system + user template) that the team is shipping. - A train set of 1000 labeled customer support routing examples. - A held-out dev set of 200 examples (never used by the optimizer). - A held-out test set of 100 examples (opened exactly once at the end). - A judge based on semantic similarity with a documented rubric. - A per-example ground truth OR per-example rubric score. ## Your job ### Phase 1 — Baselines Measure the current prompt on the dev set. Report user satisfaction (CSAT) and the guardrail metrics. This is your floor. Any optimizer output that can't beat this floor on the dev set is a failure. ### Phase 2 — Configure manual grid search over temperature+system Document: - Search space (what does manual grid search over temperature+system vary? instructions, few-shots, format hints, decoding params?) - Budget (number of candidate prompts, number of eval runs per candidate, total tokens, total wallclock) - Stopping rule (e.g., no improvement for N rounds) - Seed handling - Safety constraint: any candidate that increases safety-violation rate is auto-rejected even if it boosts user satisfaction (CSAT). ### Phase 3 — Run Run manual grid search over temperature+system. Log every candidate prompt with its score, cost, and any guardrail violations. Keep the top-K. ### Phase 4 — Human review Take the top 3 candidate prompts and read them by hand. Reject any that: - Hard-code dev-set patterns into the prompt (overfitting via instructions). - Remove or weaken safety language. - Exploit judge-specific shortcuts (e.g., adding fluff tokens the semantic similarity happens to reward). - Are incomprehensible (even if they score well — unmaintainable prompts are a liability). ### Phase 5 — Final eval Run the surviving candidate(s) on the never-touched test set. Report: - user satisfaction (CSAT) delta vs. baseline (with CI) - Guardrail deltas - Cost & latency deltas - Qualitative delta vs. baseline on 10 sampled outputs ### Phase 6 — Ship / hold / rethink Write a one-page decision. Include what you'd do differently next round, what the optimizer revealed about the task, and whether the improvement was instruction-level or few-shot-level. ## Constraints - Never use the test set during optimization. Never. - Never let the optimizer edit the safety section of the system prompt. - Never accept a candidate that the on-call engineer can't explain in one sentence. - Report negative results. Failed campaigns are still knowledge. Output the full campaign report in Markdown.