Token Stealing: Direct Model Weight Probing
Hardening against extraction of internal system instructions.
Steps
- Enforce 'Output Randomization' to prevent log-prob analysis.
- Limit agent responses to specific JSON or Markdown schemas.
- Monitor for 'Low-Temperature' probing of sensitive system keys.
- Block queries that ask the agent to 'simulate a terminal' or 'debug weights'.
- Implement a token-rate limit for high-precision output turns.