Claude Prompt for Prompt Optimization & Evals
Token-cost and latency reduction playbook for a sales lead qualification prompt running on Claude 3.5 Sonnet, judged by exact match.
More prompts for Prompt Optimization & Evals.
Run a rigorous A/B test on prompt variants for API design decisions, measuring cost-per-correct-answer on Claude Opus 4.5 using rubric scoring.
Design an eval harness for bug root-cause analysis using BLEU/ROUGE that tracks token cost across prompt versions on Llama 3.3 70B.
Design an eval harness for bug root-cause analysis using DeepEval metrics that tracks refusal rate across prompt versions on GPT-4o.
Token-cost and latency reduction playbook for a academic grading prompt running on Claude Opus 4.5, judged by human pairwise comparison.
Run a rigorous A/B test on prompt variants for legal brief summarization, measuring hallucination rate on o1-mini using promptfoo assertions.
Run a rigorous A/B test on prompt variants for API design decisions, measuring toolcall precision on GPT-4o-mini using Trulens feedback functions.
You are doing a cost and latency pass on a sales lead qualification prompt deployed on Claude 3.5 Sonnet. The team wants to cut mean tokens per request by ~30% without losing accuracy. Produce a concrete reduction plan — not hand-waving. ## Starting point - Current prompt: (the team will paste it in) - Current average mean tokens per request: (the team will give a number) - Accuracy floor: the current accuracy level on the eval set is the line we must not drop under - Judge: exact match ## Reduction levers, in order of expected ROI for Claude 3.5 Sonnet For each lever, give: (1) what to try, (2) why it usually helps on Claude 3.5 Sonnet, (3) how to measure safely, (4) risk. ### Lever 1 — Shrink the system prompt Audit the system prompt line by line. Mark each line: - KEEP (load-bearing, changes behavior if removed) - MERGE (can combine with another line) - DROP (redundant, vague, or unmeasurably helpful) Run a "remove-line ablation": drop one candidate line, re-eval on 200 examples, keep the drop if accuracy holds. ### Lever 2 — Move static context out of every request Anything that never changes between calls (tool list, schema definitions, style guide) should be: - Cached (if Claude 3.5 Sonnet supports prompt caching — note: Anthropic and OpenAI both do, with different semantics) - Or moved to a fine-tune / custom-instruction layer if the deploy supports it ### Lever 3 — Shorten few-shots - Drop few-shots that aren't pulling weight (measure by removing each one and seeing per-example deltas). - Compress each remaining exemplar: shorter inputs, tighter outputs, kill filler commentary. ### Lever 4 — Tighten the output contract If the output is free-form prose, switch to a structured format with a tight schema. Less token budget spent on politeness, hedging, and restating the question. ### Lever 5 — Reasoning-token budget If Claude 3.5 Sonnet supports extended thinking / reasoning tokens, set an explicit budget. Default is often too generous for sales lead qualification. ### Lever 6 — Consider downshifting For easy strata of sales lead qualification, route to a cheaper model (Claude 3.5 Sonnet's smaller sibling). Only hard strata go to the full Claude 3.5 Sonnet. Add a router prompt + a classifier check. ### Lever 7 — Batch and stream - Batch calls where latency allows (lower mean tokens per request per request). - Stream tokens back to the user so perceived latency drops even if raw latency doesn't. ## Deliverable A reduction plan with this shape: ``` ## Current state - mean tokens per request: X - Accuracy: Y ## Plan 1. <lever> — expected delta, how measured, risk. 2. ... 3. ... ## Eval gates - Accuracy drop allowed: <= 1% absolute - Format compliance: must stay >= current - Safety violations: 0 ## Rollout - Feature-flagged at 5% traffic - Monitor for 48h - Ramp to 100% if gates hold ``` ## Constraints - Do not "optimize" by silently removing safety rules. - Do not chase mean tokens per request into regressions on rare but important strata — measure per-stratum. - Document every lever you tried AND rejected — that's valuable institutional memory.