Claude Prompt for Prompt Optimization & Evals
Use manual grid search over temperature+system to optimize a funnel analysis prompt on GPT-4.1 against inter-judge agreement without regressing safety.
More prompts for Prompt Optimization & Evals.
Run a rigorous A/B test on prompt variants for API design decisions, measuring cost-per-correct-answer on Claude Opus 4.5 using rubric scoring.
Design an eval harness for bug root-cause analysis using BLEU/ROUGE that tracks token cost across prompt versions on Llama 3.3 70B.
Design an eval harness for bug root-cause analysis using DeepEval metrics that tracks refusal rate across prompt versions on GPT-4o.
Token-cost and latency reduction playbook for a academic grading prompt running on Claude Opus 4.5, judged by human pairwise comparison.
Run a rigorous A/B test on prompt variants for legal brief summarization, measuring hallucination rate on o1-mini using promptfoo assertions.
Run a rigorous A/B test on prompt variants for API design decisions, measuring toolcall precision on GPT-4o-mini using Trulens feedback functions.
You are running an automated prompt-optimization campaign using manual grid search over temperature+system to improve inter-judge agreement for a funnel analysis prompt on GPT-4.1. ## Inputs you have - A current prompt (system + user template) that the team is shipping. - A train set of 500 labeled funnel analysis examples. - A held-out dev set of 200 examples (never used by the optimizer). - A held-out test set of 200 examples (opened exactly once at the end). - A judge based on regex match checks with a documented rubric. - A per-example ground truth OR per-example rubric score. ## Your job ### Phase 1 — Baselines Measure the current prompt on the dev set. Report inter-judge agreement and the guardrail metrics. This is your floor. Any optimizer output that can't beat this floor on the dev set is a failure. ### Phase 2 — Configure manual grid search over temperature+system Document: - Search space (what does manual grid search over temperature+system vary? instructions, few-shots, format hints, decoding params?) - Budget (number of candidate prompts, number of eval runs per candidate, total tokens, total wallclock) - Stopping rule (e.g., no improvement for N rounds) - Seed handling - Safety constraint: any candidate that increases safety-violation rate is auto-rejected even if it boosts inter-judge agreement. ### Phase 3 — Run Run manual grid search over temperature+system. Log every candidate prompt with its score, cost, and any guardrail violations. Keep the top-K. ### Phase 4 — Human review Take the top 3 candidate prompts and read them by hand. Reject any that: - Hard-code dev-set patterns into the prompt (overfitting via instructions). - Remove or weaken safety language. - Exploit judge-specific shortcuts (e.g., adding fluff tokens the regex match checks happens to reward). - Are incomprehensible (even if they score well — unmaintainable prompts are a liability). ### Phase 5 — Final eval Run the surviving candidate(s) on the never-touched test set. Report: - inter-judge agreement delta vs. baseline (with CI) - Guardrail deltas - Cost & latency deltas - Qualitative delta vs. baseline on 10 sampled outputs ### Phase 6 — Ship / hold / rethink Write a one-page decision. Include what you'd do differently next round, what the optimizer revealed about the task, and whether the improvement was instruction-level or few-shot-level. ## Constraints - Never use the test set during optimization. Never. - Never let the optimizer edit the safety section of the system prompt. - Never accept a candidate that the on-call engineer can't explain in one sentence. - Report negative results. Failed campaigns are still knowledge. Output the full campaign report in Markdown.