Prompt Leakage: Defending Instruction Integrity

Security · updated Mon Feb 23

Hardening agents against 'Ignore previous instructions' and 'Show me your prompt' attacks.

Steps

  1. Add a 'Final Guardrail' turn that filters for instruction-revealing phrases.
  2. Use 'Instruction Encapsulation' (tags) to define the boundaries of the prompt.
  3. Implement a 'Negative Constraint' specifically against sharing system logic.
  4. Run automated Red-Teaming scripts to test for leakage vulnerabilities.
  5. Use a 'Two-Stage' architecture: One agent to reason, one to format the output.

view raw JSON →