Prompts/Prompt Engineering/Prompt Optimization & Evals

FreePrompt Engineering💬 ChatGPT

Build regex match checks Eval Harness for bug root-cause analysis on Grok 3

ChatGPT Prompt for Prompt Optimization & Evals

Design an eval harness for bug root-cause analysis using regex match checks that tracks factuality across prompt versions on Grok 3.

Related prompts

More prompts for Prompt Optimization & Evals.

Browse all Prompt Engineering →

Prompt Engineering

Premium

A/B Test cost-per-correct-answer Between Two Prompts for API design decisions on Claude Opus 4.5

Run a rigorous A/B test on prompt variants for API design decisions, measuring cost-per-correct-answer on Claude Opus 4.5 using rubric scoring.

Build BLEU/ROUGE Eval Harness for bug root-cause analysis on Llama 3.3 70B

Design an eval harness for bug root-cause analysis using BLEU/ROUGE that tracks token cost across prompt versions on Llama 3.3 70B.

Build DeepEval metrics Eval Harness for bug root-cause analysis on GPT-4o

Design an eval harness for bug root-cause analysis using DeepEval metrics that tracks refusal rate across prompt versions on GPT-4o.

Cut token cost by 30% on academic grading Prompt for Claude Opus 4.5

Token-cost and latency reduction playbook for a academic grading prompt running on Claude Opus 4.5, judged by human pairwise comparison.

A/B Test hallucination rate Between Two Prompts for legal brief summarization on o1-mini

Run a rigorous A/B test on prompt variants for legal brief summarization, measuring hallucination rate on o1-mini using promptfoo assertions.

A/B Test toolcall precision Between Two Prompts for API design decisions on GPT-4o-mini

Run a rigorous A/B test on prompt variants for API design decisions, measuring toolcall precision on GPT-4o-mini using Trulens feedback functions.

🤖Any Model

1941514

You are the owner of the eval harness for a team shipping an LLM feature that does bug root-cause analysis on Grok 3. Your harness needs to be strict enough that people trust it, cheap enough that they run it, and flexible enough that they extend it. ## What you are building A reusable eval harness with these responsibilities: 1. Load a versioned dataset of bug root-cause analysis examples sourced from regression regression suite. 2. Run any registered prompt variant against Grok 3 with pinned decoding params. 3. Score each output using regex match checks against a per-example ground truth or rubric. 4. Log metrics, especially factuality, and guardrail metrics (refusal rate, format compliance, safety). 5. Produce a diff report between two variants. 6. Be runnable both in CI (on every prompt PR) and ad-hoc locally. ## Deliverable Produce a complete design doc with the following sections: ### Architecture A sketch (text is fine) of: ``` Dataset v{N} → Runner → Model call → Output → Judge → Metrics store → Report ↑ ↓ Prompt registry CI gate (pass/fail) ``` ### Dataset spec - Schema: { id, input, expected, stratum, tags, source_url, created_at, retired_at } - Sourcing plan from regression regression suite - Refresh cadence (how often to add new examples from production) - Retirement policy (when examples become stale) - Sampling strategy for CI (small fast set) vs. full (slow, nightly) ### Runner spec - How to pin Grok 3 version (include exact version string) - Decoding params are stored alongside the prompt, not hard-coded - Retry + timeout behavior - Caching: runs are deterministic by (prompt_hash, example_id, model_version, decoding_hash) ### Judging spec — using regex match checks - Define the scoring procedure precisely. - If regex match checks is an LLM, pin the judge model (different from the model under test) and publish the judge prompt — treat it as a first-class artifact. - Calibrate regex match checks against a small human-labeled set; report inter-judge agreement (κ) before trusting it. - Flakiness mitigation: average 3 judge runs or use majority-vote if variance is high. ### Metrics - Primary: factuality - Guardrails: refusal_rate, format_compliance, safety_violations, p95_latency_ms, mean_tokens, $/example - Per-stratum slices ### Reporting Example report table (Markdown): | variant | factuality | refusal% | format% | p95_ms | $/ex | | --- | --- | --- | --- | --- | --- | Plus a "Biggest disagreements" section for qualitative review. ### CI gating - PRs that modify a prompt file must include an eval run. - Block the PR if factuality drops >2% OR any guardrail crosses its threshold. - Override requires explicit approver and a written justification committed to the PR. ### Code sketch Provide a ~40-line Python skeleton using plain stdlib + `anthropic` or `openai` client. No fancy frameworks. Functions: `load_dataset`, `run_variant`, `judge`, `score`, `report`, `gate`. ## Constraints - Don't recommend a paid SaaS eval platform unless the team already uses it. - Don't let judge prompts live un-versioned. - Keep the first working version buildable in one afternoon.

Build regex match checks Eval Harness for bug root-cause analysis on Grok 3

Related prompts

A/B Test cost-per-correct-answer Between Two Prompts for API design decisions on Claude Opus 4.5

Build BLEU/ROUGE Eval Harness for bug root-cause analysis on Llama 3.3 70B

Build DeepEval metrics Eval Harness for bug root-cause analysis on GPT-4o

Cut token cost by 30% on academic grading Prompt for Claude Opus 4.5

A/B Test hallucination rate Between Two Prompts for legal brief summarization on o1-mini

A/B Test toolcall precision Between Two Prompts for API design decisions on GPT-4o-mini

Build regex match checks Eval Harness for bug root-cause analysis on Grok 3

Related prompts

A/B Test cost-per-correct-answer Between Two Prompts for API design decisions on Claude Opus 4.5

Build BLEU/ROUGE Eval Harness for bug root-cause analysis on Llama 3.3 70B

Build DeepEval metrics Eval Harness for bug root-cause analysis on GPT-4o

Cut token cost by 30% on academic grading Prompt for Claude Opus 4.5

A/B Test hallucination rate Between Two Prompts for legal brief summarization on o1-mini

A/B Test toolcall precision Between Two Prompts for API design decisions on GPT-4o-mini

Tags

Who this is for