Prompt Leakage: Defending Instruction Integrity

Security · updated Mon Feb 23

Hardening agents against 'Ignore previous instructions' and 'Show me your prompt' attacks.

Steps

Add a 'Final Guardrail' turn that filters for instruction-revealing phrases.
Use 'Instruction Encapsulation' (tags) to define the boundaries of the prompt.
Implement a 'Negative Constraint' specifically against sharing system logic.
Run automated Red-Teaming scripts to test for leakage vulnerabilities.
Use a 'Two-Stage' architecture: One agent to reason, one to format the output.

view raw JSON →