Process Pools and JSON IPC: Multi-Worker TTS in Rust

The narrator-tts pipeline processes full novels — 120,000 words, 14 hours of audio, roughly 800 segments. Each segment needs to be sent to a TTS model running on an NVIDIA RTX 4070 Ti. The model loads weights into VRAM, runs inference, produces a waveform, and returns it for stitching.

The question was: how do you parallelize this?

The obvious answer in Rust is threads. Spawn a thread per segment, share a reference to the model, and let the scheduler sort it out. That is the answer most Rust developers would reach for, and it is wrong for GPU-bound work. Here is what I did instead.

Why Threads Are Wrong for GPU Work

CPU-bound parallelism is about using multiple cores. You spawn eight threads on an eight-core machine, and you get close to 8x throughput. The OS scheduler manages time slices, context switches are cheap, and mutexes keep shared state safe.

GPU-bound work breaks every one of those assumptions.

The GPU is a single resource. An RTX 4070 Ti has one CUDA context. When two threads try to run inference on the same model simultaneously, they are not parallelizing anything — they are serialized at the CUDA driver level. You have added thread synchronization overhead on top of work that was already sequential.

VRAM is not managed by the OS scheduler. A thread that loads a model into VRAM allocates GPU memory. If you spawn four threads and each one loads a model, you need 4x the VRAM. On a 12.3GB card, that is a non-starter. The OS cannot page GPU memory to disk. There is no swap. You either fit or you crash with an out-of-memory error.

CUDA contexts do not compose. If a thread panics while holding a CUDA context, the context is corrupted. Every other thread using that context now has undefined behavior. The error might not surface until the next inference call, hundreds of lines away from the actual failure. Debugging this is how you lose a weekend.

Threads give you the illusion of parallelism on hardware that does not support it. The GPU processes one inference at a time. Your threading model should reflect that reality, not fight it.

The Worker Process Model

Instead of threads, narrator-tts uses a pool of worker processes. The architecture has three pieces:

The orchestrator (narrator-tts) is the main process. It reads the manuscript, splits it into segments, and manages the work queue. It does not touch the GPU directly.

The workers (qbench-worker) are separate processes. Each worker loads the TTS model into VRAM, waits for text on stdin, runs inference, and writes the result to stdout. One model instance per worker, one worker at a time on this hardware.

The IPC layer is JSON over stdin/stdout. The orchestrator writes a JSON message to the worker's stdin. The worker reads it, processes it, and writes a JSON response to stdout. Stderr is reserved for logs.

// Request from orchestrator to worker
{
  "type": "generate",
  "segment_id": "ch03_seg014",
  "text": "The corridor stretched into darkness.",
  "voice": "narrator_default",
  "output_path": "/tmp/audio/ch03_seg014.wav"
}

// Response from worker to orchestrator
{
  "type": "result",
  "segment_id": "ch03_seg014",
  "status": "ok",
  "duration_ms": 41200,
  "samples": 998400
}

That is the entire protocol. A JSON line in, a JSON line out. No sockets, no shared memory, no FFI boundary to manage.

Why stdin/stdout Instead of Sockets

I considered several IPC mechanisms. Named pipes, Unix domain sockets, TCP on localhost, shared memory segments. They all work. They all add complexity that JSON-over-stdio does not have.

No connection management. The orchestrator spawns the worker process and immediately has a pipe to it. No handshake, no retry logic, no port allocation. The worker is alive when the process exists and dead when it does not.

No serialization framework. The messages are JSON. Every language can produce and consume JSON. If I need to write a worker in Python tomorrow, the IPC layer does not change. If I need to debug a message, I can tail -f the pipe or log it to a file.

Natural backpressure. If the orchestrator writes faster than the worker can read, the pipe buffer fills up, and the write blocks. If the worker writes faster than the orchestrator can read, the same thing happens in reverse. The OS handles flow control. No deadlocks, no dropped messages.

The simplest IPC mechanism that works is the best one. stdin/stdout with line-delimited JSON is debuggable with cat, testable without infrastructure, and portable across every platform Rust runs on.

Why Process Isolation Matters

The strongest argument for processes over threads is not performance. It is fault isolation.

A worker process that crashes does not take down the orchestrator. The orchestrator sees that the worker exited with a non-zero status, logs the error, spawns a replacement, and requeues the segment that was being processed. The pipeline keeps running.

With threads, a panic in one thread can corrupt shared state in the process. Even with catch_unwind, the CUDA context may be in an invalid state. The only safe recovery is to restart the entire process — which means reloading all models, losing the work queue, and starting the segment over from scratch.

Process isolation gives you clean failure boundaries. Each worker owns its VRAM, its CUDA context, its model instance. When it dies, the OS reclaims those resources. There is no leaked state, no corrupted context, no half-initialized model. A fresh process starts clean.

This matters especially for long-running jobs. An 8-hour audiobook generation job will hit transient errors — CUDA driver timeouts, thermal throttling, model artifacts that cause NaN outputs. Each of these should cause one segment to retry, not bring down the pipeline.

The Practical Details

The orchestrator spawns workers using std::process::Command. Each worker runs as:

qbench-worker --model narrator_v2.dat --device cuda:0

The worker writes a {"type":"ready"} message to stdout when the model is loaded and it is ready to accept work. The orchestrator waits for this signal before sending the first segment. This handles the startup cost correctly — model loading takes 15-20 seconds, and no work is sent until the worker signals readiness.

The orchestrator maintains a simple work queue: a list of segment IDs that need processing. When a worker finishes a segment, the orchestrator sends the next one from the queue. When the queue is empty, the orchestrator sends a {"type":"shutdown"} message, and the worker exits cleanly.

Error handling is straightforward. If a worker returns an error status, the segment goes back on the queue. If the worker process dies entirely, the segment goes back on the queue and a new worker is spawned. After three consecutive failures for the same segment, it is flagged for manual review.

On this hardware — RTX 4070 Ti, one worker at a time — the throughput is about 20 segments per hour. That is GPU-bound. The IPC overhead, the JSON serialization, the process spawning — all of it combined adds less than 100ms per segment. The bottleneck is inference, which is exactly what you want. You want the GPU to be the limiting factor, not your coordination layer.

The General Principle

The pattern is: match your concurrency model to your hardware reality.

A GPU is not a CPU. It does not have cores that you can independently schedule work onto from application code. It has a single context, bounded memory, and a driver that serializes access. Your architecture should reflect that.

When the shared resource is a GPU:

  • Use processes, not threads. Each process owns its CUDA context.
  • Use simple IPC. JSON over stdin/stdout is enough.
  • Design for clean failure and restart. A dead worker should be replaceable without state corruption.
  • Accept that real parallelism comes from the pipeline, not from concurrent GPU access. Overlap CPU work (text splitting, audio stitching, ASR verification) with GPU work (inference).

The GPU does one thing at a time. Build your system around that fact instead of pretending you can change it.

This is not a clever trick. It is the boring, correct architecture for GPU-bound work in Rust. Process pool, JSON pipes, clean restarts. It has run 800-segment jobs without intervention, and when a worker dies, the pipeline does not even pause.