The Audiobook Pipeline That Replaced a $900 Subscription
In March I wrote about building a failure-tolerant TTS narration pipeline. The core insight then was that long-form TTS is a systems problem, not a model problem.
A lot has changed since then.
The pipeline can now produce multi-voice audiobooks with character-specific voices, automatic rewriting, quality checks, and recovery from crashes. It has a GUI. It runs on consumer hardware. And it costs nothing per book beyond the electricity.
This is what that journey looks like six months in.
From Python to Rust
The original pipeline was Python. It worked for prototyping but struggled with reliability on long runs. Memory leaks, threading issues, and the general fragility of glue code started adding up.
The rewrite in Rust was the single best decision I made. Not because Rust is inherently better for TTS — the model does not care what language calls it. But Rust's type system and ownership model eliminated entire categories of bugs that were eating my time in Python.
The workspace now has three crates:
narrator-tts is the core pipeline. It handles text preprocessing, segmentation, voice assignment, generation, quality checks, and audio assembly. It uses the tch crate for PyTorch/libtorch bindings, which gives direct access to CUDA without the overhead of a Python runtime.
narrator-gui is the operator interface. Settings management, process spawning, real-time progress monitoring. It talks to the CLI backend through a clean manifest and progress contract.
narrator-slideshow is early days, but the vision is an integrated tool for authors to create narrated slideshow content from their manuscripts.
The separation matters. The CLI is the canonical backend. The GUI is an operator interface. The manifest is the contract between them. Workers are the scaling layer.
Voice Cloning: The Core Feature
Here is something that took me too long to understand.
The real feature is not synthetic voice generation. The real feature is voice cloning from reference audio.
Synthetic voices — where you describe a voice and the model designs one for you — sound decent. They are useful as a fallback when you do not have a reference. But they lack the character and texture that makes a voice feel like a specific person.
Voice cloning from reference audio is where the magic happens. You give the model a few seconds of someone speaking, and it produces narration in that voice. The results are dramatically better than synthetic generation.
This insight reshaped the entire product vision. The narrator pipeline is not trying to replace voice actors. It is trying to extend them.
A voice actor records a few minutes of reference audio. The pipeline produces audiobook chapters in their voice. The voice actor earns a commission every time their voice is used. The author gets a professional-quality narration without paying studio rates.
The short-term reality is simpler — I am using my own voice and my friends' voices as reference audio for personal projects. But the business architecture is designed around the idea that voice actors are the customers, not the casualties.
The Rewriter Problem
One of the hardest problems was not the TTS at all. It was the text.
Published books are written for readers. Audiobooks need to be written for speakers. The difference is subtle but pervasive.
Curly quotes need to become straight quotes. Section breaks need to become pauses. Labels like "DRINK ME" need to be treated as text, not dialogue. Attributions need to be handled differently. Abbreviations need expansion.
I built a three-stage rewriting pipeline using a local LLM (Gemma) running through Ollama. It preprocesses the manuscript, runs an LLM rewrite pass for voice tagging and attribution, then post-processes to catch the edge cases the LLM misses.
The temperature sweet spot was 0.3. Lower than that and it over-splits dialogue. Higher than that and it starts making errors. LLM non-determinism means the same input can produce different character attributions across runs, so the post-processor needs to be robust.
After months of iteration, the word fidelity rate is around 98%. Nearly perfect, with the remaining 2% being acceptable creative adaptations.
The Architecture Lesson
The biggest architectural lesson from this project is one I keep relearning.
The CLI is the canonical backend. The GUI is an operator interface.
When I started, I tried to build the GUI and the backend together. Every feature needed GUI changes and backend changes simultaneously. Testing was hard because I had to interact with the GUI to verify anything.
Once I separated them, everything accelerated. The CLI became the source of truth. Features got built and tested through the CLI first. The GUI was just a thin layer that spawned the CLI process and monitored its output.
This is not a new insight. It is conventional wisdom in software architecture. But it is the kind of insight that you have to experience the pain of ignoring before it really sticks.
Where It Stands Now
The system has roughly 600 tests passing across narrator-tts and narrator-gui. It can generate multi-voice audiobooks with per-character voice assignment. It has a recovery system that detects and restarts failed jobs. It has a GUI with VRAM management controls and Gutenberg integration for one-click book downloads.
The RTX 4070 Ti produces about 14x real-time, meaning a 10-hour audiobook takes roughly 140 minutes of GPU time. Not bad for a 12GB consumer card.
The next big milestones are per-segment resume (so a crash at segment 47 does not redo segments 1-46), a real-time progress dashboard in the GUI, and the multi-voice feature reaching full production quality.
What Six Months Taught Me
The lesson I keep learning from every AI project is the same.
The model is never the hard part.
The hard part is everything around the model. The preprocessing. The post-processing. The error handling. The recovery. The testing. The user interface. The architecture that separates concerns cleanly enough that you can iterate without breaking things.
TTS models are good enough for production today. They have been for a while. What is not good enough is the infrastructure around them. The tooling that turns a research model into something an author can actually use to produce an audiobook.
That is the gap this project fills. And it is the gap that most AI projects fill — the unglamorous engineering work that happens between the model and the user.
It is not exciting. But it is the difference between a demo and a product.