Defensive system prompt enforcing refusal-quality grader and no medical diagnosis for data-analysis pair on Claude 4 Sonnet.
Defensive system prompt enforcing hallucination flag + retry and refuse hate speech for data-analysis pair on o3-mini.
Defensive system prompt enforcing hallucination flag + retry and no self-harm content for data-analysis pair on Llama 3.1 405B.
Defensive system prompt enforcing hallucination flag + retry and no medical diagnosis for data-analysis pair on Claude 4.5 Sonnet.
Defensive system prompt enforcing human-in-the-loop escalation and no self-harm content for data-analysis pair on Mistral Large.
Defensive system prompt enforcing input classifier and no medical diagnosis for data-analysis pair on Claude Opus 4.5.
Defensive system prompt enforcing input classifier and refuse hate speech for data-analysis pair on GPT-4o.
Defensive system prompt enforcing input classifier and no self-harm content for writing editor on Mistral Large.
Defensive system prompt enforcing per-turn policy check and no medical diagnosis for writing editor on Claude Opus 4.5.
Defensive system prompt enforcing per-turn policy check and no self-harm content for writing editor on Mistral Small 3.
Defensive system prompt enforcing tool-authorization gate and no medical diagnosis for writing editor on Claude Haiku 4.
Defensive system prompt enforcing tool-authorization gate and refuse hate speech for writing editor on GPT-4.1.