Prompt Leakage: Defending Instruction Integrity
Hardening agents against 'Ignore previous instructions' and 'Show me your prompt' attacks.
Steps
- Add a 'Final Guardrail' turn that filters for instruction-revealing phrases.
- Use 'Instruction Encapsulation' (tags) to define the boundaries of the prompt.
- Implement a 'Negative Constraint' specifically against sharing system logic.
- Run automated Red-Teaming scripts to test for leakage vulnerabilities.
- Use a 'Two-Stage' architecture: One agent to reason, one to format the output.