Prompts/Agent Development/Computer Use & Browser Agents

FreeAgent Development💬 ChatGPT

Sandboxed Computer Use Eval Harness for customer support research companies on Crunchbase

ChatGPT Prompt for Computer Use & Browser Agents

Reproducible eval sandbox for testing Computer Use / browser agents on research companies on Crunchbase in customer support context. Fixture sites, gold trajectories, and regression gates.

397 copies828 views⭐ 4.6 (33 ratings)

Prompt

You can't ship Computer Use agents without evals. Most teams ship vibes-based. Build a reproducible eval harness for browser/computer agents tackling research companies on Crunchbase in the customer support domain. Runtime: Python 3.12 + poetry.

## Part 1 — Why this is hard

Production sites change daily. Evaluating an agent against live Amazon / LinkedIn / Salesforce is:
- Non-reproducible (the site changes between runs)
- Risky (real actions have real consequences)
- Expensive (real accounts, real rate limits)

The answer: fixture sites + gold trajectories + careful live smoke.

## Part 2 — Fixture sites

Host local clones of the target UIs:
- Forked open-source clones where they exist
- Recorded HAR replays served via a proxy (for read-only workflows)
- Hand-built minimal reproductions of the key flows
- Fixed state (so `book flights` always has the same flights, `apply to jobs` always has the same postings)

For research companies on Crunchbase in customer support, list 3 candidate fixture strategies. Pick one, justify, scaffold it.

## Part 3 — Gold trajectories

For research companies on Crunchbase, author 15 gold trajectories:
- 5 straightforward (happy path)
- 5 edge cases (empty results, multi-page pagination, form validation errors, slow loads, cookie banners)
- 5 adversarial (wrong login, rate-limited, session expired mid-flow, unexpected modal)

Each trajectory is a (parameterized input → expected final state) pair, plus an optional expected action sequence for strict graders.

## Part 4 — Graders

Three grading modes:
1. **State-based**: final state (URL + DOM snapshot + extracted data) matches expected
2. **Trajectory match**: did the agent take roughly the right actions? (edit distance on action sequence)
3. **LLM-as-judge**: for fuzzy outcomes, show another model the final screenshot + goal and ask if done

Which grader for each trajectory? Write the routing rule.

## Part 5 — Regression harness

Every agent PR runs the eval:
- Full suite on fixture sites
- Smoke suite (3 of the 15) on live sites, sandboxed account
- Report: per-trajectory pass/fail, delta vs. main, cost delta, latency delta

CI gates:
- No trajectory that passed on main may fail on PR
- Cost per task may not regress by >15%

## Part 6 — Safety for live smoke

When running against real sites:
- Dedicated test account, never real user data
- Network egress allowlist
- Financial ceiling (e.g. if research companies on Crunchbase touches money, hard-block transactions in test mode)
- Every run's artifacts retained for audit

## Part 7 — Failure analysis

When a trajectory fails, make debugging fast:
- Side-by-side: agent's final state vs. expected
- Action-level diff with gold
- Screenshot at each step
- Cost / step breakdown

## Part 8 — customer support specifics

For customer support, what regulatory / compliance constraints apply to the eval setup? (e.g. healthcare → no real PHI; fintech → sandbox accounts only; legal → no real client data)

## Part 9 — Reporting

Per-release scorecard:
- Suite pass rate
- Cost per task
- Latency p50/p95
- New failures
- Flaky tests (re-run policy)

Publish to the team's dashboard.

## Part 10 — Implementation

Deliver:
- Fixture site setup (Docker compose)
- Gold trajectory files + schema
- Grader modules (state / trajectory / LLM-judge)
- Eval runner (parallel, with cost/latency capture)
- CI integration
- Dashboard scaffolding

Produce runnable code. Anyone on the team should be able to clone, `pnpm eval`, and get a green or red build.

Who this is for

Eval a Computer Use agent reproducibly
Ship CI gates for browser agents
Build fixture sites for agent testing

Browse all Agent Development prompts →

Reproducible eval sandbox for testing Computer Use / browser agents on triage tickets in Zendesk in education context. Fixture sites, gold trajectories, and regression gates.

🟠Claude

1641491

Sandboxed Computer Use Eval Harness for customer support research companies on Crunchbase

Tags

Who this is for

Related prompts

Computer Use Agent to fill job applications on company portals with o3-mini

Computer Use Agent to manage ads in Meta Ads Manager with Claude Opus 4

Computer Use Agent to fill job applications on company portals with Claude Sonnet 4.5

Computer Use Agent to download reports from Stripe dashboard with Gemini 2.0 Flash

Sandboxed Computer Use Eval Harness for cybersecurity schedule posts in Buffer

Sandboxed Computer Use Eval Harness for education triage tickets in Zendesk

Sandboxed Computer Use Eval Harness for customer support research companies on Crunchbase

Tags

Who this is for

Related prompts

Computer Use Agent to fill job applications on company portals with o3-mini

Computer Use Agent to manage ads in Meta Ads Manager with Claude Opus 4

Computer Use Agent to fill job applications on company portals with Claude Sonnet 4.5

Computer Use Agent to download reports from Stripe dashboard with Gemini 2.0 Flash

Sandboxed Computer Use Eval Harness for cybersecurity schedule posts in Buffer

Sandboxed Computer Use Eval Harness for education triage tickets in Zendesk