Instrument, query, and triage agent with tool-use LLM app traces in Phoenix (Arize) with Go SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage content moderation LLM app traces in Phoenix (Arize) with Ruby SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage voice agent LLM app traces in Phoenix (Arize) with Ruby SDK, covering latency, cost, and quality dashboards.
Golden-set regression harness for customer support chat with GPT-4.1 rubric scorer scoring, CI integration, and budget-aware runs.
Golden-set regression harness for RAG over internal docs with Claude Opus 4.5 pairwise scoring, CI integration, and budget-aware runs.
Golden-set regression harness for code review agent with Ragas faithfulness judge scoring, CI integration, and budget-aware runs.
Golden-set regression harness for SQL generation with Claude Sonnet 4.5 rubric scorer scoring, CI integration, and budget-aware runs.
Golden-set regression harness for medical Q&A with Arena-Hard-Auto scoring, CI integration, and budget-aware runs.
Golden-set regression harness for legal analysis with G-Eval with Gemini 2.5 Pro scoring, CI integration, and budget-aware runs.
Golden-set regression harness for tool-use agent with Arena-Hard-Auto scoring, CI integration, and budget-aware runs.
Design A/B rollout analysis and drift detection for jailbreak resistance on a production LLM app in agent-based workflows.
Design A/B rollout analysis and drift detection for jailbreak resistance on a production LLM app in code assistant.