ChatGPT Prompt for Computer Use & Browser Agents
Reproducible eval sandbox for testing Computer Use / browser agents on research companies on Crunchbase in customer support context. Fixture sites, gold trajectories, and regression gates.
You can't ship Computer Use agents without evals. Most teams ship vibes-based. Build a reproducible eval harness for browser/computer agents tackling research companies on Crunchbase in the customer support domain. Runtime: Python 3.12 + poetry. ## Part 1 — Why this is hard Production sites change daily. Evaluating an agent against live Amazon / LinkedIn / Salesforce is: - Non-reproducible (the site changes between runs) - Risky (real actions have real consequences) - Expensive (real accounts, real rate limits) The answer: fixture sites + gold trajectories + careful live smoke. ## Part 2 — Fixture sites Host local clones of the target UIs: - Forked open-source clones where they exist - Recorded HAR replays served via a proxy (for read-only workflows) - Hand-built minimal reproductions of the key flows - Fixed state (so `book flights` always has the same flights, `apply to jobs` always has the same postings) For research companies on Crunchbase in customer support, list 3 candidate fixture strategies. Pick one, justify, scaffold it. ## Part 3 — Gold trajectories For research companies on Crunchbase, author 15 gold trajectories: - 5 straightforward (happy path) - 5 edge cases (empty results, multi-page pagination, form validation errors, slow loads, cookie banners) - 5 adversarial (wrong login, rate-limited, session expired mid-flow, unexpected modal) Each trajectory is a (parameterized input → expected final state) pair, plus an optional expected action sequence for strict graders. ## Part 4 — Graders Three grading modes: 1. **State-based**: final state (URL + DOM snapshot + extracted data) matches expected 2. **Trajectory match**: did the agent take roughly the right actions? (edit distance on action sequence) 3. **LLM-as-judge**: for fuzzy outcomes, show another model the final screenshot + goal and ask if done Which grader for each trajectory? Write the routing rule. ## Part 5 — Regression harness Every agent PR runs the eval: - Full suite on fixture sites - Smoke suite (3 of the 15) on live sites, sandboxed account - Report: per-trajectory pass/fail, delta vs. main, cost delta, latency delta CI gates: - No trajectory that passed on main may fail on PR - Cost per task may not regress by >15% ## Part 6 — Safety for live smoke When running against real sites: - Dedicated test account, never real user data - Network egress allowlist - Financial ceiling (e.g. if research companies on Crunchbase touches money, hard-block transactions in test mode) - Every run's artifacts retained for audit ## Part 7 — Failure analysis When a trajectory fails, make debugging fast: - Side-by-side: agent's final state vs. expected - Action-level diff with gold - Screenshot at each step - Cost / step breakdown ## Part 8 — customer support specifics For customer support, what regulatory / compliance constraints apply to the eval setup? (e.g. healthcare → no real PHI; fintech → sandbox accounts only; legal → no real client data) ## Part 9 — Reporting Per-release scorecard: - Suite pass rate - Cost per task - Latency p50/p95 - New failures - Flaky tests (re-run policy) Publish to the team's dashboard. ## Part 10 — Implementation Deliver: - Fixture site setup (Docker compose) - Gold trajectory files + schema - Grader modules (state / trajectory / LLM-judge) - Eval runner (parallel, with cost/latency capture) - CI integration - Dashboard scaffolding Produce runnable code. Anyone on the team should be able to clone, `pnpm eval`, and get a green or red build.
More prompts for Computer Use & Browser Agents.
End-to-end Computer Use agent that can fill job applications on company portals autonomously. Screenshot loop, action grounding, safety gates, and recovery from unexpected UI states.
End-to-end Computer Use agent that can manage ads in Meta Ads Manager autonomously. Screenshot loop, action grounding, safety gates, and recovery from unexpected UI states.
End-to-end Computer Use agent that can fill job applications on company portals autonomously. Screenshot loop, action grounding, safety gates, and recovery from unexpected UI states.
End-to-end Computer Use agent that can download reports from Stripe dashboard autonomously. Screenshot loop, action grounding, safety gates, and recovery from unexpected UI states.
Reproducible eval sandbox for testing Computer Use / browser agents on schedule posts in Buffer in cybersecurity context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on triage tickets in Zendesk in education context. Fixture sites, gold trajectories, and regression gates.