The Heartbeat Pattern: An AI Agent That Checks Its Own Systems
Most AI agent setups have a blind spot. The agent can write code, run tests, generate content — but when something stalls, nobody notices until a human happens to look.
I have a TTS pipeline that generates audiobook chapters. It runs for hours. Sometimes a worker process hangs. Sometimes VRAM fills up and the next inference call panics. Sometimes a model file gets corrupted and every subsequent generation produces garbage audio that sounds like static run through a washing machine.
If I am watching the terminal, I catch it. If I am asleep or at the grocery store, the pipeline quietly fails for six hours. The next morning I have a folder full of empty WAV files and a queue that says "processing."
The fix was not better error handling. Error handling catches exceptions. Most of these failures are not exceptions — they are hangs, stalls, and silent degradations. The code is running. It is just not producing anything useful.
The fix was a heartbeat.
What a Heartbeat Actually Is
A heartbeat is a periodic self-check. Every five minutes, the agent wakes up, looks at its own systems, and asks: is everything OK?
The heartbeat is not a cron job that runs a specific task. It is a monitoring loop that checks the state of ongoing work. Think of it as the difference between a calendar reminder ("run backups at 3 AM") and a smoke detector ("is something on fire right now?").
The heartbeat pattern has three parts: detection, decision, and action.
Detection: What to Check
My heartbeat polls five things every five minutes.
Task queue health. Is the task queue moving? If the same task has been in "processing" state for more than four hours, it is probably stuck. The stale-detection threshold is the key tuning parameter — too short and you kill long-running jobs that are working fine, too long and you waste hours on dead processes.
Process liveness. Are the worker processes I spawned still alive? A GPU worker can segfault without the parent process noticing, especially when you are spawning subprocesses for inference. The heartbeat checks process IDs against the process table.
Resource availability. Is there enough VRAM free? If the GPU is at 98% utilization and not making progress, something is wrong. The heartbeat reads nvidia-smi output and flags concerning states.
Output validation. Are the most recent outputs valid? For TTS, this means checking that audio files exist, are non-empty, and have a reasonable duration. A zero-byte WAV file means the generation failed silently.
Catalog freshness. Has the work catalog been updated recently? If nobody has written to the catalog in the last 30 minutes during an active job, work has stalled.
The checks are simple. Each one is a few lines of code. The power is in the aggregation — five trivial checks, combined, catch almost every failure mode the system has ever hit.
Decision: When to Act
Detection without decision is just logging. You get a dashboard full of red marks and nobody looks at it.
The decision layer is a set of rules:
- Stale task + dead process — mark the task as failed, return it to the queue, restart the worker.
- Stale task + live process — the process is hung. Kill it, mark the task as failed, restart.
- Low VRAM + stalled progress — flush VRAM, wait for cleanup, resume.
- Invalid output + live process — the model is producing garbage. Kill the worker, restart with a fresh model load.
- Any check fails for 3 consecutive heartbeats — alert the human.
The three-consecutive rule matters. A single failed check is often transient — a GPU memory spike during model loading, a file system hiccup, a momentary stall. Three in a row is a pattern.
The decision rules should be conservative. It is better to wait one extra heartbeat cycle than to kill a working process. Restarting workers has a cost — model loading, VRAM allocation, context restoration. You do not want to do it unnecessarily.
Action: What to Do
The action layer is where the agent earns its keep. When the heartbeat decides something is wrong, it does one of three things:
Self-heal. Restart the worker, flush VRAM, re-queue the task. No human involved. The heartbeat logs the action and moves on. This handles about 80% of failures.
Alert. Send a message to Discord with the specific failure, the action taken, and what — if anything — the human should check. This handles the 15% of cases where self-healing worked but something underlying might need attention.
Page. Send a high-priority notification. This is for the 5% of cases where the system cannot self-heal — model file corruption, disk full, GPU disconnected. The human needs to intervene.
The goal is not zero human intervention. The goal is that the human only intervenes for problems that actually require human judgment. Everything else — hung processes, VRAM exhaustion, stale workers — is handled automatically.
Why the Heartbeat Lives Inside the Agent
You could build this as a separate monitoring service. A Prometheus instance. A Datadog dashboard. A shell script in a crontab. Those all work.
But there is an advantage to putting the heartbeat inside the agent itself.
The agent has context. It knows what task was running when the worker hung. It knows which model was loaded. It knows the task history, the retry count, the VRAM state. When it restarts a worker, it can do so with the right parameters — not just "restart process 12345" but "re-queue chapter 47 with the Codex voice profile at segment 12."
A monitoring system that understands the work it is monitoring makes better decisions than one that only sees process trees and memory usage.
External monitors answer "is the process alive?" The agent's heartbeat answers "is the work progressing?" Those are different questions. A process can be alive and producing garbage. A process can be dead but the work already completed and the exit was just messy. The agent's heartbeat knows the difference.
The Cost
The heartbeat runs every five minutes. Each poll takes about two seconds — reading the process table, checking file sizes, querying VRAM. The compute cost is negligible.
The real cost is design time. You have to think about failure modes upfront. What does "stuck" look like for your system? What does "garbage output" look like? What is the right stale-detection threshold — 2 hours? 4? 8?
These are not questions you answer once. They are questions you refine every time the system fails in a new way and the heartbeat did not catch it. Each miss becomes a new check. Over time, the heartbeat gets better. My current heartbeat catches failures I used to lose hours to. The first version caught almost nothing — it just checked whether the process was alive. Every few weeks, a new failure mode would surface, I would add a check for it, and the heartbeat got a little smarter.
The General Principle
If your AI system runs unattended, it needs to monitor itself. Not through external dashboards that a human has to remember to check. Through an internal loop that detects problems and takes action.
The heartbeat pattern is not complex. It is five checks, a few decision rules, and three action levels. What makes it work is that it runs constantly, it understands the work, and it has the authority to act.
Start simple. One check — is the process alive? Add checks every time you discover a new failure mode. Let the system earn its own reliability, one heartbeat at a time.