Reproducible eval sandbox for testing Computer Use / browser agents on book flights on Google Flights in developer tooling context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on manage ads in Meta Ads Manager in developer tooling context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on update records in Salesforce in developer tooling context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on manage ads in Meta Ads Manager in customer support context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on download reports from Stripe dashboard in customer support context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on book flights on Google Flights in customer support context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on scrape product listings on Amazon in customer support context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on fill expense reports in Concur in customer support context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on monitor competitor pricing in customer support context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on pull analytics from GA4 dashboard in customer support context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on reconcile invoices in QuickBooks in customer support context. Fixture sites, gold trajectories, and regression gates.
Reproducible eval sandbox for testing Computer Use / browser agents on extract leads from Apollo.io in customer support context. Fixture sites, gold trajectories, and regression gates.