Building a Recovery System for AI Pipelines That Refuse to Die Quietly

The worst failure mode in any automated system is not a crash.

A crash is loud. It leaves a stack trace. It tells you what happened.

The worst failure mode is silence.

The process disappears. No error. No output. No indication that anything was ever running. You wake up the next morning expecting a finished audiobook chapter and instead you find an empty directory and a ghost.

That happened to me three times before I decided to fix it properly.

The Problem With Long-Running AI Jobs

My narrator pipeline generates audiobook chapters using a local TTS model. A single chapter can take hours. The model uses most of my GPU's 12GB of VRAM. Workers occasionally crash from memory pressure. The pipeline has resume logic, but only if someone is around to trigger it.

Overnight runs were the worst. I would start a chapter before bed, check on it in the morning, and find nothing. No audio. No log file. No error message. Just a process that had quietly stopped existing at some point during the night.

The pipeline could resume itself. It had the capability. It just needed someone to restart it.

That is not good enough.

The Recovery Pattern

I built a PowerShell script that runs every five minutes as part of Karl's heartbeat check. It does four things:

Check the catalog. Every TTS job writes a catalog entry before it starts. The entry includes the book title, output directory, a full restart command, retry count, and maximum retries. If there is no catalog entry, there is nothing to recover.

Check if the process is alive. If the catalog shows a job in progress but no narrator-tts process is running, the job died. That is the silent failure.

Restart the job. The script takes the stored restart command, adds a resume flag, and relaunches the pipeline. The retry count gets incremented.

Give up gracefully. After three failed retries, or two retries with zero progress, the job gets marked as failed. The system stops trying and flags it for human attention. I use simple linear retries because failures are almost always process death or OOM — not transient network errors. Exponential backoff with jitter would be appropriate for network-dependent jobs, but for local GPU workloads, the failure mode doesn't benefit from delay.

Why Pre-Flight Writes Matter

The most important lesson was non-negotiable: every job must write its catalog entry before it starts.

Not after the first chapter. Not when the first segment completes. Before the pipeline begins.

If the catalog entry does not exist before the job starts, the recovery system has no way to know a job was supposed to be running. This is exactly how my overnight runs disappeared. No catalog entry meant no recovery.

This is essentially a write-ahead intent log — the same pattern databases use for crash recovery. It is a general principle that applies far beyond TTS. Any long-running AI process should write its intent to a durable store before it begins execution. Not for convenience. For survival.

If the system crashes before writing its intent, nobody knows it was supposed to be there.

The Heartbeat Integration

The recovery script runs inside Karl's heartbeat loop. Every five minutes, Karl checks:

  • Are there any stale jobs in the catalog?
  • Is the process that should be running actually running?
  • Should I restart it?

This is cheap. The script runs in under a second. It checks a JSON file and queries running processes. No GPU usage, no network calls, no overhead.

But it catches failures that would otherwise be invisible.

The Three Categories of Failure

After running this system for a while, I started noticing patterns in how AI pipelines fail.

Loud failures. The process crashes with a stack trace. The OS logs an error. This is the easiest category because you know immediately that something went wrong.

Quiet failures. The process exits cleanly but produced no useful output. Maybe the model loaded but generated silence. Maybe the configuration was wrong. The pipeline thinks it succeeded. The recovery system does not trigger because the process is no longer running, but it was never marked as failed.

Silent failures. The process disappears. No trace. No exit code. This is the category that killed my overnight runs. The only way to catch these is external monitoring — something outside the pipeline checking whether it is still alive.

The recovery system handles all three. Loud failures trigger a restart. Quiet failures get caught when the output does not match expectations. Silent failures get caught by the process liveness check.

What I Would Do Differently

The system works, but there are improvements I can already see.

The catalog entry should include expected output size or segment count, so the recovery system can detect quiet failures where the process ran but produced almost nothing.

The retry logic should be smarter about failure causes. An out-of-memory crash is transient and worth retrying. A configuration error will fail the same way every time. Right now the system retries both identically.

And the recovery script should run health checks before restarting — is there enough VRAM? Is Ollama running? Is there disk space? Restarting a job that will immediately crash again just wastes the retry budget.

But the core pattern is sound. Write your intent. Check your liveness. Restart on failure. Give up gracefully.

It is not glamorous engineering. But it is the difference between a system that works when you are watching and a system that works while you sleep.