Achieving 0.0% Word Loss Across a Full Book TTS Pipeline

The self-checking TTS loop I wrote about last week catches garbled segments. Generate audio, transcribe it with Whisper, compare to the source text, retry if the mismatch is too high. That system works. It catches dropped words, audio glitches, silent failures — all at the segment level.

But segment-level QA has a blind spot. It verifies that each segment is good in isolation. It does not verify that all the segments are there.

This is the gap: a segment can pass its individual QA check, but what if a segment never got generated at all? What if the text splitter clipped a sentence at a boundary? What if the manuscript had 800 segments but the pipeline only processed 797?

Per-segment quality is not the same as whole-book fidelity. One is about each piece being correct. The other is about the complete work being intact.

The Problem of Missing Words

I found the problem the hard way. After generating the audiobook for the first book in a four-book series, I had run the self-checking loop on every segment. All 797 segments passed. The audio sounded clean. I was ready to publish.

Then I ran a rough word count. The source manuscript had 118,400 words. The ASR transcription of all the generated audio — concatenated end to end — had 117,850 words. That is a difference of 550 words. Roughly three pages of text simply vanished somewhere in the pipeline.

Per-segment QA had passed because each segment that was generated was correct. The missing words were in segments that were never processed. The text splitter had failed on three chapter breaks with unusual formatting — colons in chapter titles that the regex did not expect — and silently dropped those segments from the work queue.

A segment that is never generated cannot fail a quality check. It does not exist. The QA loop has nothing to evaluate. The absence is invisible until you look at the whole.

End-to-End Reconciliation

The fix is a second layer of verification that operates at the book level, not the segment level. I call it fidelity reconciliation.

The system works in three steps:

1. Collect the source words. Take the original manuscript, strip formatting markers, and produce a normalized word list. Lowercase everything. Expand contractions. Remove punctuation. The goal is a clean sequence of words that represents the literal content of the book.

2. Collect the generated words. Concatenate every segment's ASR transcription — the same Whisper output that the per-segment loop already produced — into one continuous word list. Apply the same normalization.

3. Reconcile. Walk both lists and find every word in the source that has no match in the generated output. Report the missing words, their positions in the source text, and the overall fidelity percentage.

The fidelity percentage is simple:
(source_words - missing_words) / source_words * 100.
A book with 0 missing words scores 100.0%. Anything less means content was lost.

What the Reconciliation Catches

After adding this system and running it across all four books, the missing-word report identified five categories of loss:

Missing segments. The original problem. Segments that the splitter dropped or that the work queue never picked up. Three chapters in book one. One chapter in book three. These were the big losses — entire paragraphs of text that never became audio.

Boundary clipping. The text splitter sometimes broke a sentence at an awkward point, and the TTS model treated the leading or trailing word as noise. The per-segment QA loop tolerated this because the mismatch was small — one word in a 60-word segment is a 1.7% error rate, under the 2% threshold. But across 800 segments, those single-word losses added up to dozens of missing words.

Unpronounceable terms. Invented vocabulary that the TTS model produced as a brief pause or a garbled phoneme. Whisper could not transcribe it, so it appeared as a missing word in the reconciliation. Not an audio glitch — the model genuinely could not say the word.

Script artifacts. Stage directions, scene break markers (***), and chapter headers that were in the source text but should not have been narrated. These showed up as "missing" in the reconciliation but were actually intentional omissions. The system needed an exclusion list.

Duplicate generation. A subtle one. When a segment was retried by the QA loop, the old (failed) audio file was sometimes not cleaned up. The ASR transcription ran on both files, producing duplicate words that masked missing words elsewhere. The reconciliation needed to deduplicate.

Each of these failure modes was invisible to the per-segment QA loop. The loop checks quality. Reconciliation checks completeness. You need both.

Getting to Zero

The path from 550 missing words to zero was iterative:

Fix the splitter. The regex-based chapter break detector was replaced with a parser that handles colons, numbers, and mixed-case titles. No more dropped segments at boundaries.

Tighten the QA threshold. The per-segment error threshold went from 2% to 1%. This triggered more retries but eliminated the accumulated single-word losses at segment boundaries.

Add an exclusion list. Scene break markers, chapter headers, and other non-narrated text are stripped from the source word list before reconciliation. They are intentional omissions, not losses.

Clean up retries. When a segment is regenerated, the old audio file and its transcription are deleted. The reconciliation only sees the final output.

After these fixes, the fidelity report for all four books read:

Book 1: 118,400 source words, 118,400 matched. Fidelity: 100.0%
Book 2: 124,100 source words, 124,100 matched. Fidelity: 100.0%
Book 3: 131,800 source words, 131,800 matched. Fidelity: 100.0%
Book 4: 126,200 source words, 126,200 matched. Fidelity: 100.0%

Missing words: 0

Zero. Every word in the manuscript is present in the generated audio.

Why the Reconciliation Layer Matters

You might ask: if the per-segment QA loop works, and the splitter is fixed, why do you still need the book-level reconciliation?

Because pipelines rot. The splitter that handles today's manuscript format will encounter a new format next month. The TTS model will be updated, and the update might handle boundaries differently. A worker process will die at the wrong moment and a segment will be marked complete without generating audio.

The reconciliation layer is the safety net that assumes everything upstream is imperfect. It does not trust the splitter, the queue, the workers, or the QA loop. It takes the final output and the original source and checks that they match. When they do not, it tells you exactly what is missing and where.

Defense in depth is not paranoia. It is the acknowledgment that any single check — no matter how well designed — will miss something. The reconciliation layer exists because the question "is each segment good?" is a different question from "is the whole book complete?"

The per-segment loop catches quality. The reconciliation layer catches completeness. Together they answer the only question that matters: did every word in the book make it into the audio?

Build both. Run both. Trust neither alone.

Achieving 0.0% Word Loss Across a Full Book TTS Pipeline

Achieving 0.0% Word Loss Across a Full Book TTS Pipeline

The Problem of Missing Words

End-to-End Reconciliation

What the Reconciliation Catches

Getting to Zero

Why the Reconciliation Layer Matters

About the Author

Join the Conversation

Achieving 0.0% Word Loss Across a Full Book TTS Pipeline

Achieving 0.0% Word Loss Across a Full Book TTS Pipeline

The Problem of Missing Words

End-to-End Reconciliation

What the Reconciliation Catches

Getting to Zero

Why the Reconciliation Layer Matters

Related Articles

About the Author

Join the Conversation