Category Not Found

3559 prompts

Sort:

Sandboxed Computer Use Eval Harness for gaming fill job applications on company portals

Reproducible eval sandbox for testing Computer Use / browser agents on fill job applications on company portals in gaming context. Fixture sites, gold trajectories, and regression gates.

Sandboxed Computer Use Eval Harness for gaming download reports from Stripe dashboard

Reproducible eval sandbox for testing Computer Use / browser agents on download reports from Stripe dashboard in gaming context. Fixture sites, gold trajectories, and regression gates.

Sandboxed Computer Use Eval Harness for gaming book flights on Google Flights

Reproducible eval sandbox for testing Computer Use / browser agents on book flights on Google Flights in gaming context. Fixture sites, gold trajectories, and regression gates.

Sandboxed Computer Use Eval Harness for gaming manage ads in Meta Ads Manager

Reproducible eval sandbox for testing Computer Use / browser agents on manage ads in Meta Ads Manager in gaming context. Fixture sites, gold trajectories, and regression gates.

Sandboxed Computer Use Eval Harness for gaming update records in Salesforce

Reproducible eval sandbox for testing Computer Use / browser agents on update records in Salesforce in gaming context. Fixture sites, gold trajectories, and regression gates.

Migrate daily research briefing Agent from Single-Loop to hierarchical multi-agent in Mastra

Refactor an existing single-loop tool-calling agent for daily research briefing into a hierarchical multi-agent architecture using Mastra. Focus: what to split, what to keep, what to evaluate.

Migrate daily research briefing Agent from Single-Loop to swarm/handoff in LangGraph

Refactor an existing single-loop tool-calling agent for daily research briefing into a swarm/handoff architecture using LangGraph. Focus: what to split, what to keep, what to evaluate.

Migrate daily research briefing Agent from Single-Loop to orchestrator-worker in Haystack agents

Refactor an existing single-loop tool-calling agent for daily research briefing into a orchestrator-worker architecture using Haystack agents. Focus: what to split, what to keep, what to evaluate.

Migrate daily research briefing Agent from Single-Loop to router in Pydantic AI

Refactor an existing single-loop tool-calling agent for daily research briefing into a router architecture using Pydantic AI. Focus: what to split, what to keep, what to evaluate.

Migrate daily research briefing Agent from Single-Loop to Deep-Research pipeline in Semantic Kernel

Refactor an existing single-loop tool-calling agent for daily research briefing into a Deep-Research pipeline architecture using Semantic Kernel. Focus: what to split, what to keep, what to evaluate.

Migrate daily research briefing Agent from Single-Loop to loop-until-done with critic in Vercel AI SDK

Refactor an existing single-loop tool-calling agent for daily research briefing into a loop-until-done with critic architecture using Vercel AI SDK. Focus: what to split, what to keep, what to evaluate.

Migrate daily research briefing Agent from Single-Loop to ReAct (Reason+Act) in AutoGen

Refactor an existing single-loop tool-calling agent for daily research briefing into a ReAct (Reason+Act) architecture using AutoGen. Focus: what to split, what to keep, what to evaluate.

🟠Claude

310287