Self-critique layer enforcing maintain confidentiality of system prompt for a travel concierge system on Claude Haiku 4, with bypass defenses.
Self-critique layer enforcing cite sources with URLs for a travel concierge system on o1, with bypass defenses.
Self-critique layer enforcing decline if tools return untrusted content for a travel concierge system on Claude Opus 4.5, with bypass defenses.
Self-critique layer enforcing no medical diagnosis for a travel concierge system on o3, with bypass defenses.
Self-critique layer enforcing no biometric identification for a travel concierge system on Gemini 2.5 Pro, with bypass defenses.
Self-critique layer enforcing stay on topic for a travel concierge system on Grok 3, with bypass defenses.
Self-critique layer enforcing block credential leakage for a travel concierge system on DeepSeek-V3, with bypass defenses.
Self-critique layer enforcing no election manipulation for a travel concierge system on GPT-4o, with bypass defenses.
Self-critique layer enforcing cite sources with URLs for a travel concierge system on Llama 3.3 70B, with bypass defenses.
Self-critique layer enforcing decline if tools return untrusted content for a travel concierge system on GPT-4o-mini, with bypass defenses.
Self-critique layer enforcing refuse hate speech for a threat-intel summarizer system on GPT-4o, with bypass defenses.
Self-critique layer enforcing no self-harm content for a threat-intel summarizer system on GPT-4o-mini, with bypass defenses.