Refactor a baseline resume screening prompt into a Faithful Chain-of-Thought version and compare quality on o3.
Refactor a baseline math word problems prompt into a Faithful Chain-of-Thought version and compare quality on DeepSeek-R1.
Refactor a baseline multi-hop QA prompt into a Faithful Chain-of-Thought version and compare quality on Claude 3.7 Sonnet.
Refactor a baseline medical triage prompt into a Faithful Chain-of-Thought version and compare quality on o3-mini.
Refactor a baseline research synthesis prompt into a Faithful Chain-of-Thought version and compare quality on Llama 3.3 70B.
Refactor a baseline SQL query writing prompt into a Faithful Chain-of-Thought version and compare quality on Qwen 2.5 72B.
Refactor a baseline API design decisions prompt into a Faithful Chain-of-Thought version and compare quality on GPT-4o.
Refactor a baseline A/B test interpretation prompt into a Faithful Chain-of-Thought version and compare quality on Claude 4.5 Sonnet.
Refactor a baseline contract review prompt into a Faithful Chain-of-Thought version and compare quality on GPT-4.1.
Refactor a baseline log anomaly detection prompt into a Faithful Chain-of-Thought version and compare quality on Claude Haiku 4.
Refactor a baseline sales lead qualification prompt into a Faithful Chain-of-Thought version and compare quality on Llama 3.1 405B.
Refactor a baseline customer support routing prompt into a Faithful Chain-of-Thought version and compare quality on o3-mini.