The Self-Checking TTS Loop: Generate, Transcribe, Compare, Retry

The TTS pipeline was working. Text went in, audio came out. The narrator-tts worker processes split a manuscript into segments, generate audio for each one, and stitch the results together. A full novel — 120,000 words, roughly 14 hours of audio — takes about 8 hours on my RTX 4070 Ti.

The problem was quality verification. How do you know the audio is correct?

You can listen to it. I listened to the first few chapters of the first book I generated. It sounded great. The voice was consistent, the pacing was natural, the pronunciation was clean. Then I spot-checked chapters 8, 15, and 22. All fine. I declared it good and moved on.

Chapter 14 had a problem. One of the segments — about 45 seconds of audio — had a subtle glitch. The model produced a half-second of static in the middle of a sentence, then continued normally. The words before and after the static were correct. If you were listening casually, you might not notice. If you were listening on speakers at moderate volume, you would absolutely notice.

I did not catch it. I had not listened to chapter 14.

Spot-checking 14 hours of audio is not quality assurance. It is prayer.

The Problem With Generative Audio QA

Traditional software QA checks outputs against expected values. You assert that function(input) returns expected_output. If it does not, the test fails.

Generative audio does not have expected outputs. The same text can be read in infinitely many valid ways — different pacing, different emphasis, different breath patterns. You cannot assert that the audio matches a reference waveform. You cannot even assert that the waveform is "smooth" or "clean" in any simple way, because natural speech has irregular waveforms by design.

What you can assert is that the audio contains the right words in the right order. If the model dropped a word, repeated a word, substituted a word, or produced noise instead of a word, the text content of the audio will not match the source manuscript.

This is the insight: you don't verify the audio. You verify that the audio says what it was supposed to say.

The Loop

The QA loop has four stages:

1. Generate. The TTS worker produces audio for a text segment. This is the normal generation step — no changes to the pipeline.

2. Transcribe. Run the generated audio through Whisper. The ASR model produces a text transcription of what it heard in the audio.

3. Compare. Compare the Whisper transcription to the original source text. Normalize both — lowercase, strip punctuation, collapse whitespace — so the comparison is on word content, not formatting.

4. Retry or accept. If the transcribed text matches the source text (or is close enough within a threshold), the segment passes. If the mismatch exceeds the threshold, the segment goes back into the generation queue for another attempt.

The loop runs automatically. No human listens to anything. No human reads transcripts. The system generates, verifies, and retries on its own.

The Comparison

The comparison step is where the design decisions live. You need to handle four things:

Exact word match is too strict. Whisper sometimes transcribes "cannot" as "can not," or "well-being" as "well being." These are not errors in the audio — they are normalization differences in the transcription. A strict word-by-word comparison would flag these as mismatches and trigger unnecessary retries.

Fuzzy matching handles this. Instead of exact string equality, compare using a word error rate (WER). Normalize both texts — lowercase, remove punctuation, expand contractions — then calculate the edit distance between the word sequences. A WER of 0% means perfect match. A WER of 5% means roughly one word in twenty is different.

Set a threshold. Below the threshold, accept the segment. Above it, retry. I set the threshold at 2%. That translates to roughly one word difference in a 50-word segment. Any more than that, and something is likely wrong — a dropped word, a mispronunciation that Whisper heard as a different word, or audio corruption.

The threshold is the most important parameter in the system. Too low and you retry segments that are actually fine, wasting GPU time. Too high and you accept segments with real errors. 2% was the sweet spot after testing on about 40 hours of generated audio.

Cap the retries. A segment that fails three times in a row is flagged for manual review rather than retried indefinitely. If the TTS model consistently produces audio that Whisper cannot transcribe correctly, either the text has an issue the model cannot handle — unusual formatting, a character name it cannot pronounce — or the segment is genuinely broken. Either way, infinite retries will not fix it.

What It Catches

After running the QA loop across four full books, here is what it found:

Dropped words. The TTS model occasionally skipped a short word — "the," "a," "but" — usually at segment boundaries where the text split fell between sentences. The audio sounded natural because the grammar still worked without the word, but the content was technically wrong. Whisper caught these.

Garbled pronunciation. Unusual names and invented terms sometimes came out wrong. The model might pronounce "Etrath" (a place name) as "EE-rath" in one segment and "EH-trath" in another. Whisper transcribed these differently, and the comparison flagged the inconsistency. Not a showstopper, but worth knowing about for consistency.

Audio glitches. The half-second of static from chapter 14. Whisper transcribed the static as a burst of noise tokens or dropped the words that were spoken during the glitch. Either way, the transcription did not match the source text, and the segment was retried.

Silent failures. One segment generated zero audio — the model produced an empty waveform. This is the worst kind of failure because it is invisible. No error message, no crash, just silence. The QA loop caught it because Whisper had nothing to transcribe, which meant the comparison failed immediately.

Silent failures are the ones that justify the entire system. A crash gets noticed. A glitch gets noticed eventually. A silent failure — missing audio, empty output — can ship to readers undetected. The ASR comparison catches it because the absence of words is itself a mismatch.

The Cost

Running Whisper on every generated segment adds processing time. Whisper is fast — much faster than the TTS model — but it is not free. For a 14-hour audiobook, the ASR pass adds about 30 minutes to the total pipeline time.

That is a 6% overhead on an 8-hour job. For catching errors that would otherwise make it into a published audiobook, that is negligible.

The retry cost is also modest. In practice, about 4% of segments fail the first check and get regenerated. Most pass on the second attempt. The total retry overhead is under 10% additional generation time. The GPU cost of better quality assurance is one-tenth of a recalculation of one-tenth of the segments.

The General Principle

The pattern extends beyond TTS. Any generative pipeline that produces output you cannot exhaustively verify by hand needs a closed verification loop:

  1. Generate the output.
  2. Run it through an independent model that can check it.
  3. Compare the check result to the source of truth.
  4. Retry on failure, cap retries, flag for review.

The key word is independent. The verification model must be different from the generation model. If you use the same model to check its own output, it will confirm its own mistakes. Whisper and the TTS model are architecturally different systems trained on different objectives. When they disagree, the disagreement is signal.

A generative system without a verification loop is a system that ships untested output. The question is not whether errors exist — they do. The question is whether you find them before your readers do.

Build the loop. Let it run.