Agent Healthcheck & Liveness Probes

Reliability · updated Fri Feb 27

Defining operational 'Vital Signs' to detect and restart hung or drifting agents.

Steps

  1. Implement a `/health` endpoint that checks for LLM API heartbeat and memory availability.
  2. Set a `Liveness Probe` to detect 'Infinite Loop' hangs (no response for >60s).
  3. Configure a `Readiness Probe` to ensure vector databases are indexed and reachable.
  4. Monitor 'Token Velocity': Restart the agent if it exceeds 10k tokens in <10 seconds (loop detection).
  5. Define a 'Graceful Shutdown' period to allow agents to finalize current tool-commits.

view raw JSON →