Agent System-Prompt Boundary Enforcement

Security · updated Thu Feb 26

Maintain the integrity of core instructions against 'Instruction Drift' or user-led prompt hijacking.

Steps

Use unique, non-guessable XML/Markdown delimiters for system boundaries.
Implement 'Instruction Pinning' by repeating core constraints in the message suffix.
Perform a 'Self-Identity Check' every 5 turns to detect role-play drift.
Monitor agent output for 'Instruction Leaking' (echoing system prompt content).
Reset the system context or 'Prune' the conversation if the boundary is breached.
Use a 'Two-Pass' architecture: Plan in one context, Execute in another.