Reproducible eval sandbox for testing Computer Use / browser agents on apply to jobs on LinkedIn in cybersecurity context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on scrape product listings on Amazon in cybersecurity context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on research companies on Crunchbase in cybersecurity context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on monitor competitor pricing in cybersecurity context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on download reports from Stripe dashboard in cybersecurity context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on pull metrics from Mixpanel in cybersecurity context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on extract leads from Apollo.io in cybersecurity context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on research companies on Crunchbase in developer tooling context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on monitor competitor pricing in developer tooling context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on reconcile invoices in QuickBooks in developer tooling context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on triage tickets in Zendesk in developer tooling context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on extract leads from Apollo.io in developer tooling context. Fixture sites, gold trajectories, and regression gates.