Prompts/AI Engineering & LLM Apps/Evals & Observability

FreeAI Engineering & LLM Apps💬 ChatGPT

A/B Rollout and Drift Detection for PII leakage rate in search + answer over docs

ChatGPT Prompt for Evals & Observability

Design A/B rollout analysis and drift detection for PII leakage rate on a production LLM app in search + answer over docs.

Related prompts

More prompts for Evals & Observability.

Browse all AI Engineering & LLM Apps →

AI Engineering & LLM Apps

Free

Trace Analysis Playbook for structured extraction LLM App in Lunary (TypeScript)

Instrument, query, and triage structured extraction LLM app traces in Lunary with TypeScript SDK, covering latency, cost, and quality dashboards.

🟠Claude

1911517

AI Engineering & LLM Apps

Free

Trace Analysis Playbook for classification pipeline LLM App in OpenTelemetry + Jaeger (Ruby)

Instrument, query, and triage classification pipeline LLM app traces in OpenTelemetry + Jaeger with Ruby SDK, covering latency, cost, and quality dashboards.

🤖Any Model

521515

AI Engineering & LLM Apps

Premium

Trace Analysis Playbook for agent with tool-use LLM App in Galileo (Java)

Instrument, query, and triage agent with tool-use LLM app traces in Galileo with Java SDK, covering latency, cost, and quality dashboards.

🤖Any Model

1841514

AI Engineering & LLM Apps

Free

Trace Analysis Playbook for code-completion copilot LLM App in OpenTelemetry + Jaeger (Java)

Instrument, query, and triage code-completion copilot LLM app traces in OpenTelemetry + Jaeger with Java SDK, covering latency, cost, and quality dashboards.

💬ChatGPT

1901513

AI Engineering & LLM Apps

Premium

G-Eval with Gemini 2.5 Pro LLM-as-Judge Rubric for multi-turn dialogue

Design a pairwise + rubric LLM-as-judge prompt for multi-turn dialogue with bias mitigation, calibration, and reproducibility.

🤖Any Model

3751512

AI Engineering & LLM Apps

Premium

DeepEval correctness judge LLM-as-Judge Rubric for SQL generation

Design a pairwise + rubric LLM-as-judge prompt for SQL generation with bias mitigation, calibration, and reproducibility.

💬ChatGPT

3091508

You are a data scientist / ML engineer shipping a new prompt (or model, or retrieval config) for search + answer over docs. Design the rollout plan that lets you ship safely AND detect drift over time after ship. ## Change Under Test Examples: prompt v2.3 → v2.4, model Sonnet → Haiku, chunk size 512 → 1024, reranker on/off. For this spec, the change is: adding query transformation. Primary metric: PII leakage rate. Secondary metrics: tool-call success rate. Guardrails (must not regress): latency p95, refusal rate, error rate, cost. ## Pre-Launch: Offline Eval Run the new variant against: 1. Golden set (1000 cases) — must not regress 2. Historical production sample (1000 recent requests) — pairwise judge scoring 3. Adversarial / red-team set Pass criteria before any online traffic: - Offline pass rate ≥ control - custom rubric with chain-of-thought pairwise win rate ≥ 50% with tie allowed - No safety regressions - Latency p95 within +10% of control - Cost within budget If any criterion fails → stop, investigate, do not proceed to online. ## Rollout Stages ### Stage 1: Shadow (0% user-visible) - Both control and treatment run on 100% of traffic - Treatment output is logged, NOT shown to user - Compare outputs pairwise offline - Duration: 5 days or 10k requests (whichever first) - Gate to stage 2: pairwise win rate ≥ 50%, guardrails clean ### Stage 2: Canary (1-5% user-visible) - Random 5% of sessions → treatment - Hash-based assignment on `user_id` for consistency (same user, same variant) - Metrics collected: all online metrics + explicit feedback (thumbs, retry rate) - Duration: 3 days or 20k per arm - Gate to stage 3: primary metric ≥ control (CI non-overlapping OR p < 0.05), guardrails clean ### Stage 3: Ramp (10% → 25% → 50%) - Each step: 3 days, check metrics at each - Halt ramp if any guardrail breaches ### Stage 4: Full rollout (100%) + monitor - Control variant kept in 5% "holdout" for N weeks to measure long-term drift vs baseline - After N weeks, retire control ## Statistical Analysis ### Sample Size Primary metric: PII leakage rate. To detect effect size δ = 1% at α=0.05, power=0.8, need: - Per arm: 20k requests - At current traffic of 200k/day split 50/50 → 14 days ### Hypothesis Test - H0: treatment_metric = control_metric - H1: treatment_metric > control_metric (one-sided for improvements; two-sided for neutral changes) - Test: Welch's t-test for continuous, chi-square for proportions, bootstrap for judge scores - Multiple comparisons correction: Bonferroni if testing primary + 5 secondary → α / 6 per test ### CUPED Variance Reduction If you have pre-experiment metric per user, use CUPED to reduce variance by 20-40%, cutting required sample size proportionally. ### Decision Rules - Ship if: primary metric up significantly AND no guardrail regression - Kill if: any guardrail regresses OR primary metric significantly worse - Iterate if: primary metric neutral AND secondary signals mixed ## Drift Detection (Post-Ship) Even after shipping, monitor for drift — the world changes: - Users ask new question types - Retrieved documents rotate - Model provider silently updates - Seasonal shifts ### Signals to Monitor 1. **Input drift:** distribution of prompt embeddings → PSI or KL divergence vs baseline week 2. **Output drift:** distribution of response length, format compliance rate, refusal rate 3. **Quality drift:** rolling judge score on sampled production traffic 4. **User drift:** thumbs-down rate, retry rate, session-length changes 5. **Cost drift:** cost per request creeping up ### Metrics - **PSI (Population Stability Index)** on input embedding clusters - PSI < 0.1: no drift - 0.1-0.25: moderate, investigate - > 0.25: significant, alert - **KS test** on latency / cost distributions week-over-week - **Judge score 7d rolling avg** with 2σ bands ### Alerts - PSI > 0.25 for 3 consecutive days → slack team - Judge score 7d avg drops > 5% vs 30d → page on-call - Refusal rate up > 20% week-over-week → slack team - Top-10 most common prompt patterns change significantly → slack team ### Drift Response Playbook When drift alert fires: 1. Snapshot the current input distribution 2. Compare to baseline — what's new/different? 3. Sample 20 drifted-cluster prompts, eyeball for new intent class 4. Decide: update golden set? update retrieval corpus? adjust prompt? investigate upstream change? 5. Document in an incident log ## Observability in Langfuse - Dashboard: per-variant metrics side-by-side with CIs - Decision log: every rollout stage decision with timestamp + metrics snapshot - Drift dashboard: PSI / KL / judge-score trends - Retention: 180 days metrics, 30 days raw samples ## Deliverables 1. Rollout plan doc (stages, gates, stat criteria) 2. Sample-size + power calc notebook 3. Langfuse dashboards (variant comparison, drift) 4. Alert rules (thresholds + paging policy) 5. Drift playbook 6. Post-mortem template for kills/incidents Structure as a professional report with: Executive Summary, Key Findings, Detailed Analysis, Recommendations, and Next Steps.

A/B Rollout and Drift Detection for PII leakage rate in search + answer over docs

Related prompts

Trace Analysis Playbook for structured extraction LLM App in Lunary (TypeScript)

Trace Analysis Playbook for classification pipeline LLM App in OpenTelemetry + Jaeger (Ruby)

Trace Analysis Playbook for agent with tool-use LLM App in Galileo (Java)

Trace Analysis Playbook for code-completion copilot LLM App in OpenTelemetry + Jaeger (Java)

G-Eval with Gemini 2.5 Pro LLM-as-Judge Rubric for multi-turn dialogue

DeepEval correctness judge LLM-as-Judge Rubric for SQL generation

A/B Rollout and Drift Detection for PII leakage rate in search + answer over docs

Related prompts

Trace Analysis Playbook for structured extraction LLM App in Lunary (TypeScript)

Trace Analysis Playbook for classification pipeline LLM App in OpenTelemetry + Jaeger (Ruby)

Trace Analysis Playbook for agent with tool-use LLM App in Galileo (Java)

Trace Analysis Playbook for code-completion copilot LLM App in OpenTelemetry + Jaeger (Java)

G-Eval with Gemini 2.5 Pro LLM-as-Judge Rubric for multi-turn dialogue

DeepEval correctness judge LLM-as-Judge Rubric for SQL generation

Tags

Who this is for