Agent System-Prompt Boundary Enforcement
Maintain the integrity of core instructions against 'Instruction Drift' or user-led prompt hijacking.
Steps
- Use unique, non-guessable XML/Markdown delimiters for system boundaries.
- Implement 'Instruction Pinning' by repeating core constraints in the message suffix.
- Perform a 'Self-Identity Check' every 5 turns to detect role-play drift.
- Monitor agent output for 'Instruction Leaking' (echoing system prompt content).
- Reset the system context or 'Prune' the conversation if the boundary is breached.
- Use a 'Two-Pass' architecture: Plan in one context, Execute in another.