Prompts/Prompt Engineering/Prompt Optimization & Evals

FreePrompt Engineering💬 ChatGPT

Reduce user satisfaction (CSAT) on customer support routing Prompt via manual grid search over temperature+system

ChatGPT Prompt for Prompt Optimization & Evals

Use manual grid search over temperature+system to optimize a customer support routing prompt on o3-mini against user satisfaction (CSAT) without regressing safety.

Related prompts

More prompts for Prompt Optimization & Evals.

Browse all Prompt Engineering →

Prompt Engineering

Premium

A/B Test cost-per-correct-answer Between Two Prompts for API design decisions on Claude Opus 4.5

Run a rigorous A/B test on prompt variants for API design decisions, measuring cost-per-correct-answer on Claude Opus 4.5 using rubric scoring.

Build BLEU/ROUGE Eval Harness for bug root-cause analysis on Llama 3.3 70B

Design an eval harness for bug root-cause analysis using BLEU/ROUGE that tracks token cost across prompt versions on Llama 3.3 70B.

Build DeepEval metrics Eval Harness for bug root-cause analysis on GPT-4o

Design an eval harness for bug root-cause analysis using DeepEval metrics that tracks refusal rate across prompt versions on GPT-4o.

Cut token cost by 30% on academic grading Prompt for Claude Opus 4.5

Token-cost and latency reduction playbook for a academic grading prompt running on Claude Opus 4.5, judged by human pairwise comparison.

A/B Test hallucination rate Between Two Prompts for legal brief summarization on o1-mini

Run a rigorous A/B test on prompt variants for legal brief summarization, measuring hallucination rate on o1-mini using promptfoo assertions.

A/B Test toolcall precision Between Two Prompts for API design decisions on GPT-4o-mini

Run a rigorous A/B test on prompt variants for API design decisions, measuring toolcall precision on GPT-4o-mini using Trulens feedback functions.

🤖Any Model

1941514

You are running an automated prompt-optimization campaign using manual grid search over temperature+system to improve user satisfaction (CSAT) for a customer support routing prompt on o3-mini. ## Inputs you have - A current prompt (system + user template) that the team is shipping. - A train set of 1000 labeled customer support routing examples. - A held-out dev set of 200 examples (never used by the optimizer). - A held-out test set of 100 examples (opened exactly once at the end). - A judge based on semantic similarity with a documented rubric. - A per-example ground truth OR per-example rubric score. ## Your job ### Phase 1 — Baselines Measure the current prompt on the dev set. Report user satisfaction (CSAT) and the guardrail metrics. This is your floor. Any optimizer output that can't beat this floor on the dev set is a failure. ### Phase 2 — Configure manual grid search over temperature+system Document: - Search space (what does manual grid search over temperature+system vary? instructions, few-shots, format hints, decoding params?) - Budget (number of candidate prompts, number of eval runs per candidate, total tokens, total wallclock) - Stopping rule (e.g., no improvement for N rounds) - Seed handling - Safety constraint: any candidate that increases safety-violation rate is auto-rejected even if it boosts user satisfaction (CSAT). ### Phase 3 — Run Run manual grid search over temperature+system. Log every candidate prompt with its score, cost, and any guardrail violations. Keep the top-K. ### Phase 4 — Human review Take the top 3 candidate prompts and read them by hand. Reject any that: - Hard-code dev-set patterns into the prompt (overfitting via instructions). - Remove or weaken safety language. - Exploit judge-specific shortcuts (e.g., adding fluff tokens the semantic similarity happens to reward). - Are incomprehensible (even if they score well — unmaintainable prompts are a liability). ### Phase 5 — Final eval Run the surviving candidate(s) on the never-touched test set. Report: - user satisfaction (CSAT) delta vs. baseline (with CI) - Guardrail deltas - Cost & latency deltas - Qualitative delta vs. baseline on 10 sampled outputs ### Phase 6 — Ship / hold / rethink Write a one-page decision. Include what you'd do differently next round, what the optimizer revealed about the task, and whether the improvement was instruction-level or few-shot-level. ## Constraints - Never use the test set during optimization. Never. - Never let the optimizer edit the safety section of the system prompt. - Never accept a candidate that the on-call engineer can't explain in one sentence. - Report negative results. Failed campaigns are still knowledge. Output the full campaign report in Markdown.

Reduce user satisfaction (CSAT) on customer support routing Prompt via manual grid search over temperature+system

Related prompts

A/B Test cost-per-correct-answer Between Two Prompts for API design decisions on Claude Opus 4.5

Build BLEU/ROUGE Eval Harness for bug root-cause analysis on Llama 3.3 70B

Build DeepEval metrics Eval Harness for bug root-cause analysis on GPT-4o

Cut token cost by 30% on academic grading Prompt for Claude Opus 4.5

A/B Test hallucination rate Between Two Prompts for legal brief summarization on o1-mini

A/B Test toolcall precision Between Two Prompts for API design decisions on GPT-4o-mini

Reduce user satisfaction (CSAT) on customer support routing Prompt via manual grid search over temperature+system

Related prompts

A/B Test cost-per-correct-answer Between Two Prompts for API design decisions on Claude Opus 4.5

Build BLEU/ROUGE Eval Harness for bug root-cause analysis on Llama 3.3 70B

Build DeepEval metrics Eval Harness for bug root-cause analysis on GPT-4o

Cut token cost by 30% on academic grading Prompt for Claude Opus 4.5

A/B Test hallucination rate Between Two Prompts for legal brief summarization on o1-mini

A/B Test toolcall precision Between Two Prompts for API design decisions on GPT-4o-mini

Tags

Who this is for