Refactor a baseline technical spec writing prompt into a Program-of-Thoughts version and compare quality on Claude 3.7 Sonnet.
Refactor a baseline math word problems prompt into a Program-of-Thoughts version and compare quality on DeepSeek-V3.
Refactor a baseline legal brief summarization prompt into a Program-of-Thoughts version and compare quality on Qwen 2.5 72B.
Refactor a baseline bug root-cause analysis prompt into a Chain-of-Verification version and compare quality on Gemini 2.0 Flash.
Refactor a baseline incident post-mortems prompt into a Chain-of-Verification version and compare quality on GPT-4o-mini.
Refactor a baseline technical spec writing prompt into a Chain-of-Verification version and compare quality on o1-mini.
Refactor a baseline schema migration planning prompt into a Chain-of-Verification version and compare quality on DeepSeek-V3.
Refactor a baseline funnel analysis prompt into a Chain-of-Verification version and compare quality on Claude 3.7 Sonnet.
Refactor a baseline resume screening prompt into a Chain-of-Verification version and compare quality on o3.
Refactor a baseline math word problems prompt into a Chain-of-Verification version and compare quality on Llama 3.3 70B.
Refactor a baseline multi-hop QA prompt into a Chain-of-Verification version and compare quality on Claude 4 Sonnet.
Refactor a baseline medical triage prompt into a Chain-of-Verification version and compare quality on o3-mini.