Prompts/Prompt Engineering/Prompt Optimization & Evals

FreePrompt Engineering💬 ChatGPT

Build JSON schema validation Eval Harness for bug root-cause analysis on Llama 3.3 70B

ChatGPT Prompt for Prompt Optimization & Evals

Design an eval harness for bug root-cause analysis using JSON schema validation that tracks F1 score across prompt versions on Llama 3.3 70B.

Related prompts

More prompts for Prompt Optimization & Evals.

Browse all Prompt Engineering →

Prompt Engineering

Premium

A/B Test cost-per-correct-answer Between Two Prompts for API design decisions on Claude Opus 4.5

Run a rigorous A/B test on prompt variants for API design decisions, measuring cost-per-correct-answer on Claude Opus 4.5 using rubric scoring.

Build BLEU/ROUGE Eval Harness for bug root-cause analysis on Llama 3.3 70B

Design an eval harness for bug root-cause analysis using BLEU/ROUGE that tracks token cost across prompt versions on Llama 3.3 70B.

Build DeepEval metrics Eval Harness for bug root-cause analysis on GPT-4o

Design an eval harness for bug root-cause analysis using DeepEval metrics that tracks refusal rate across prompt versions on GPT-4o.

Cut token cost by 30% on academic grading Prompt for Claude Opus 4.5

Token-cost and latency reduction playbook for a academic grading prompt running on Claude Opus 4.5, judged by human pairwise comparison.

A/B Test hallucination rate Between Two Prompts for legal brief summarization on o1-mini

Run a rigorous A/B test on prompt variants for legal brief summarization, measuring hallucination rate on o1-mini using promptfoo assertions.

A/B Test toolcall precision Between Two Prompts for API design decisions on GPT-4o-mini

Run a rigorous A/B test on prompt variants for API design decisions, measuring toolcall precision on GPT-4o-mini using Trulens feedback functions.

🤖Any Model

1941514

You are the owner of the eval harness for a team shipping an LLM feature that does bug root-cause analysis on Llama 3.3 70B. Your harness needs to be strict enough that people trust it, cheap enough that they run it, and flexible enough that they extend it. ## What you are building A reusable eval harness with these responsibilities: 1. Load a versioned dataset of bug root-cause analysis examples sourced from regression regression suite. 2. Run any registered prompt variant against Llama 3.3 70B with pinned decoding params. 3. Score each output using JSON schema validation against a per-example ground truth or rubric. 4. Log metrics, especially F1 score, and guardrail metrics (refusal rate, format compliance, safety). 5. Produce a diff report between two variants. 6. Be runnable both in CI (on every prompt PR) and ad-hoc locally. ## Deliverable Produce a complete design doc with the following sections: ### Architecture A sketch (text is fine) of: ``` Dataset v{N} → Runner → Model call → Output → Judge → Metrics store → Report ↑ ↓ Prompt registry CI gate (pass/fail) ``` ### Dataset spec - Schema: { id, input, expected, stratum, tags, source_url, created_at, retired_at } - Sourcing plan from regression regression suite - Refresh cadence (how often to add new examples from production) - Retirement policy (when examples become stale) - Sampling strategy for CI (small fast set) vs. full (slow, nightly) ### Runner spec - How to pin Llama 3.3 70B version (include exact version string) - Decoding params are stored alongside the prompt, not hard-coded - Retry + timeout behavior - Caching: runs are deterministic by (prompt_hash, example_id, model_version, decoding_hash) ### Judging spec — using JSON schema validation - Define the scoring procedure precisely. - If JSON schema validation is an LLM, pin the judge model (different from the model under test) and publish the judge prompt — treat it as a first-class artifact. - Calibrate JSON schema validation against a small human-labeled set; report inter-judge agreement (κ) before trusting it. - Flakiness mitigation: average 3 judge runs or use majority-vote if variance is high. ### Metrics - Primary: F1 score - Guardrails: refusal_rate, format_compliance, safety_violations, p95_latency_ms, mean_tokens, $/example - Per-stratum slices ### Reporting Example report table (Markdown): | variant | F1 score | refusal% | format% | p95_ms | $/ex | | --- | --- | --- | --- | --- | --- | Plus a "Biggest disagreements" section for qualitative review. ### CI gating - PRs that modify a prompt file must include an eval run. - Block the PR if F1 score drops >2% OR any guardrail crosses its threshold. - Override requires explicit approver and a written justification committed to the PR. ### Code sketch Provide a ~40-line Python skeleton using plain stdlib + `anthropic` or `openai` client. No fancy frameworks. Functions: `load_dataset`, `run_variant`, `judge`, `score`, `report`, `gate`. ## Constraints - Don't recommend a paid SaaS eval platform unless the team already uses it. - Don't let judge prompts live un-versioned. - Keep the first working version buildable in one afternoon.

Build JSON schema validation Eval Harness for bug root-cause analysis on Llama 3.3 70B

Related prompts

A/B Test cost-per-correct-answer Between Two Prompts for API design decisions on Claude Opus 4.5

Build BLEU/ROUGE Eval Harness for bug root-cause analysis on Llama 3.3 70B

Build DeepEval metrics Eval Harness for bug root-cause analysis on GPT-4o

Cut token cost by 30% on academic grading Prompt for Claude Opus 4.5

A/B Test hallucination rate Between Two Prompts for legal brief summarization on o1-mini

A/B Test toolcall precision Between Two Prompts for API design decisions on GPT-4o-mini

Build JSON schema validation Eval Harness for bug root-cause analysis on Llama 3.3 70B

Related prompts

A/B Test cost-per-correct-answer Between Two Prompts for API design decisions on Claude Opus 4.5

Build BLEU/ROUGE Eval Harness for bug root-cause analysis on Llama 3.3 70B

Build DeepEval metrics Eval Harness for bug root-cause analysis on GPT-4o

Cut token cost by 30% on academic grading Prompt for Claude Opus 4.5

A/B Test hallucination rate Between Two Prompts for legal brief summarization on o1-mini

A/B Test toolcall precision Between Two Prompts for API design decisions on GPT-4o-mini

Tags

Who this is for