ChatGPT Prompt for Evals & Observability
Design A/B rollout analysis and drift detection for PII leakage rate on a production LLM app in search + answer over docs.
More prompts for Evals & Observability.
Instrument, query, and triage structured extraction LLM app traces in Lunary with TypeScript SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage classification pipeline LLM app traces in OpenTelemetry + Jaeger with Ruby SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage agent with tool-use LLM app traces in Galileo with Java SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage code-completion copilot LLM app traces in OpenTelemetry + Jaeger with Java SDK, covering latency, cost, and quality dashboards.
Design a pairwise + rubric LLM-as-judge prompt for multi-turn dialogue with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for SQL generation with bias mitigation, calibration, and reproducibility.
You are a data scientist / ML engineer shipping a new prompt (or model, or retrieval config) for search + answer over docs. Design the rollout plan that lets you ship safely AND detect drift over time after ship. ## Change Under Test Examples: prompt v2.3 → v2.4, model Sonnet → Haiku, chunk size 512 → 1024, reranker on/off. For this spec, the change is: adding query transformation. Primary metric: PII leakage rate. Secondary metrics: tool-call success rate. Guardrails (must not regress): latency p95, refusal rate, error rate, cost. ## Pre-Launch: Offline Eval Run the new variant against: 1. Golden set (1000 cases) — must not regress 2. Historical production sample (1000 recent requests) — pairwise judge scoring 3. Adversarial / red-team set Pass criteria before any online traffic: - Offline pass rate ≥ control - custom rubric with chain-of-thought pairwise win rate ≥ 50% with tie allowed - No safety regressions - Latency p95 within +10% of control - Cost within budget If any criterion fails → stop, investigate, do not proceed to online. ## Rollout Stages ### Stage 1: Shadow (0% user-visible) - Both control and treatment run on 100% of traffic - Treatment output is logged, NOT shown to user - Compare outputs pairwise offline - Duration: 5 days or 10k requests (whichever first) - Gate to stage 2: pairwise win rate ≥ 50%, guardrails clean ### Stage 2: Canary (1-5% user-visible) - Random 5% of sessions → treatment - Hash-based assignment on `user_id` for consistency (same user, same variant) - Metrics collected: all online metrics + explicit feedback (thumbs, retry rate) - Duration: 3 days or 20k per arm - Gate to stage 3: primary metric ≥ control (CI non-overlapping OR p < 0.05), guardrails clean ### Stage 3: Ramp (10% → 25% → 50%) - Each step: 3 days, check metrics at each - Halt ramp if any guardrail breaches ### Stage 4: Full rollout (100%) + monitor - Control variant kept in 5% "holdout" for N weeks to measure long-term drift vs baseline - After N weeks, retire control ## Statistical Analysis ### Sample Size Primary metric: PII leakage rate. To detect effect size δ = 1% at α=0.05, power=0.8, need: - Per arm: 20k requests - At current traffic of 200k/day split 50/50 → 14 days ### Hypothesis Test - H0: treatment_metric = control_metric - H1: treatment_metric > control_metric (one-sided for improvements; two-sided for neutral changes) - Test: Welch's t-test for continuous, chi-square for proportions, bootstrap for judge scores - Multiple comparisons correction: Bonferroni if testing primary + 5 secondary → α / 6 per test ### CUPED Variance Reduction If you have pre-experiment metric per user, use CUPED to reduce variance by 20-40%, cutting required sample size proportionally. ### Decision Rules - Ship if: primary metric up significantly AND no guardrail regression - Kill if: any guardrail regresses OR primary metric significantly worse - Iterate if: primary metric neutral AND secondary signals mixed ## Drift Detection (Post-Ship) Even after shipping, monitor for drift — the world changes: - Users ask new question types - Retrieved documents rotate - Model provider silently updates - Seasonal shifts ### Signals to Monitor 1. **Input drift:** distribution of prompt embeddings → PSI or KL divergence vs baseline week 2. **Output drift:** distribution of response length, format compliance rate, refusal rate 3. **Quality drift:** rolling judge score on sampled production traffic 4. **User drift:** thumbs-down rate, retry rate, session-length changes 5. **Cost drift:** cost per request creeping up ### Metrics - **PSI (Population Stability Index)** on input embedding clusters - PSI < 0.1: no drift - 0.1-0.25: moderate, investigate - > 0.25: significant, alert - **KS test** on latency / cost distributions week-over-week - **Judge score 7d rolling avg** with 2σ bands ### Alerts - PSI > 0.25 for 3 consecutive days → slack team - Judge score 7d avg drops > 5% vs 30d → page on-call - Refusal rate up > 20% week-over-week → slack team - Top-10 most common prompt patterns change significantly → slack team ### Drift Response Playbook When drift alert fires: 1. Snapshot the current input distribution 2. Compare to baseline — what's new/different? 3. Sample 20 drifted-cluster prompts, eyeball for new intent class 4. Decide: update golden set? update retrieval corpus? adjust prompt? investigate upstream change? 5. Document in an incident log ## Observability in Langfuse - Dashboard: per-variant metrics side-by-side with CIs - Decision log: every rollout stage decision with timestamp + metrics snapshot - Drift dashboard: PSI / KL / judge-score trends - Retention: 180 days metrics, 30 days raw samples ## Deliverables 1. Rollout plan doc (stages, gates, stat criteria) 2. Sample-size + power calc notebook 3. Langfuse dashboards (variant comparison, drift) 4. Alert rules (thresholds + paging policy) 5. Drift playbook 6. Post-mortem template for kills/incidents Structure as a professional report with: Executive Summary, Key Findings, Detailed Analysis, Recommendations, and Next Steps.