Claude Prompt for Evals & Observability
Instrument, query, and triage search-and-answer LLM app traces in OpenTelemetry + Jaeger with Java SDK, covering latency, cost, and quality dashboards.
More prompts for Evals & Observability.
Instrument, query, and triage structured extraction LLM app traces in Lunary with TypeScript SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage classification pipeline LLM app traces in OpenTelemetry + Jaeger with Ruby SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage agent with tool-use LLM app traces in Galileo with Java SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage code-completion copilot LLM app traces in OpenTelemetry + Jaeger with Java SDK, covering latency, cost, and quality dashboards.
Design a pairwise + rubric LLM-as-judge prompt for multi-turn dialogue with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for SQL generation with bias mitigation, calibration, and reproducibility.
You are the observability lead for an LLM-powered product. Build the complete tracing + analysis stack in OpenTelemetry + Jaeger that lets anyone on the team answer "why was this request slow/expensive/wrong?" in under 60 seconds.
## Trace Model
Every user interaction is a trace. Within a trace:
- **Spans:** discrete steps (LLM call, retrieval, tool call, DB query)
- **Events:** instant occurrences (cache hit, retry, error)
- **Attributes:** key-value metadata on spans
### Span Naming Convention
Use `noun.verb` like:
- `llm.completion` — a single LLM API call
- `retrieval.search` — a vector search call
- `rerank.score` — rerank API call
- `tool.execute` — agent tool invocation
- `prompt.render` — template filling
- `validator.check` — schema validation
### Required Attributes (every LLM span)
- `model` — exact model id, e.g., "claude-sonnet-4-5-20251001"
- `prompt.template_name` + `prompt.template_version`
- `prompt.hash` — content hash of the exact prompt sent
- `input.tokens`, `output.tokens`, `cache.read_tokens`, `cache.creation_tokens`
- `cost.usd` (computed from tokens × model price)
- `latency.ms`
- `user.id` (hashed, not raw PII)
- `session.id`, `request.id`
- `error.type` (if any), `error.message`
- `feedback.thumbs` (if user rated later, backfill)
- `quality.judge_score` (if offline-scored)
### PII Handling
- Do NOT log raw prompts/completions by default in production traces
- Instead log hashes + sampled raw (e.g., 1% of traffic → raw, 99% → hash only)
- Redact emails, phones, SSNs with a regex pre-processor
- User-consented debug mode can enable raw logging per session
## Instrumentation
### Python (OpenTelemetry + Jaeger)
```python
from helicone-logger import trace, observe
@observe(name="llm.completion")
async def complete(prompt: str, model: str):
span = trace.current_span()
span.set_attribute("model", model)
span.set_attribute("prompt.hash", sha256(prompt))
response = await client.messages.create(...)
span.set_attribute("input.tokens", response.usage.input_tokens)
span.set_attribute("output.tokens", response.usage.output_tokens)
span.set_attribute("cost.usd", compute_cost(...))
return response
```
### TypeScript (OpenTelemetry + Jaeger)
Equivalent decorator / wrapper. Share context via async local storage.
## Dashboards
### Dashboard 1: Latency
- Distribution of end-to-end trace latency (p50, p95, p99)
- Breakdown by span type (which step is slow?)
- Top-10 slowest traces in last 24h (click to drill in)
- Correlation: latency vs input_tokens, latency vs model, latency vs cache_hit
### Dashboard 2: Cost
- Cost per user cohort per day
- Cost per feature / endpoint
- Cost breakdown by model (are we paying premium for tasks that could use cheaper tier?)
- Cache savings: total tokens served from cache vs fresh
- Anomaly detection: sudden spike flagged
### Dashboard 3: Quality
- Judge score rolling average
- Pass rate on regression suite (per release)
- User thumbs (thumbs_up / total) over time
- Refusal rate (target range alert if drifts)
- Distribution of confidence scores
### Dashboard 4: Errors
- Error rate per endpoint
- Top error messages (grouped)
- Rate-limit events
- Repair/retry rate (structured output)
- Tool-call validation failures
## Common Triage Queries
### "Why is this request slow?"
Open the trace. Look at the Gantt view. Identify the longest span. Check if it's:
- LLM call → check input tokens, model, provider status
- Retrieval → check index size, filter selectivity
- Tool call → check downstream service latency
- Sequential where it could be parallel → opportunity to fix
### "Why is cost up 30% this week?"
Query: cost by model × day. Often one of:
- Traffic growth (check request count)
- Prompt length creep (check avg input_tokens over time)
- Cache hit rate dropped (check cache.hit ratio)
- Tier mix shift (more traffic to expensive model)
- New feature launched with expensive prompts (filter by endpoint × release_tag)
### "Why did quality regress?"
Query: judge_score by prompt_template_version. Bisect to the change that introduced the drop. Pull 10 sample regressions for manual review.
## Alerts
Configure in OpenTelemetry + Jaeger:
- `latency_p95 > 3s` for 10 min → page on-call
- `error_rate > 2%` for 5 min → page on-call
- `cost_per_day > $X` → slack finance channel
- `judge_score_7d_avg drops > 5%` → slack team
- `repair_rate > 3%` → slack team
## Sampling Strategy
- Head-based sampling: 100% traces with errors, 10% of normal traffic
- Keep raw prompts for 1% of traffic + all errors + all low-quality-scored
- Retention: 30 days hot, 180 days cold archive
## Golden-Set Replay
Automate weekly replay of the golden set:
- Trigger: Sunday 2am UTC
- Run each golden example through production code path
- Auto-score with judge
- Post summary to slack: score delta vs last week, top 5 regressions
## Deliverables
1. Instrumentation library wrappers for each LLM provider
2. OpenTelemetry + Jaeger dashboards (exportable JSON)
3. Alert rules
4. Triage playbook with queries pinned in OpenTelemetry + Jaeger
5. Weekly report automation
6. Onboarding doc: "how to debug an LLM request in OpenTelemetry + Jaeger"
Structure as a playbook with: Overview, Prerequisites, Step-by-step Plays, Metrics to Track, and Troubleshooting Guide.