Agent System-Prompt Boundary Enforcement

Security · updated Thu Feb 26

Maintain the integrity of core instructions against 'Instruction Drift' or user-led prompt hijacking.

Steps

  1. Use unique, non-guessable XML/Markdown delimiters for system boundaries.
  2. Implement 'Instruction Pinning' by repeating core constraints in the message suffix.
  3. Perform a 'Self-Identity Check' every 5 turns to detect role-play drift.
  4. Monitor agent output for 'Instruction Leaking' (echoing system prompt content).
  5. Reset the system context or 'Prune' the conversation if the boundary is breached.
  6. Use a 'Two-Pass' architecture: Plan in one context, Execute in another.

view raw JSON →