Squeezing 11.7GB Into 12.3GB: VRAM Management for Local AI Workloads

Everybody talks about model quality. Nobody talks about VRAM. But when you are running local AI — actually running it, not calling an API — VRAM is the constraint that determines everything. Which model you can load. How many workers you can spawn. Whether your pipeline runs or crashes at 2 AM with an out-of-memory error.

My RTX 4070 Ti has 12.3 GB of VRAM. The TTS model I use — a 1.7B parameter voice cloning model — needs about 5.2 GB just to sit in memory. Add inference buffers, KV cache, audio processing overhead, and the CUDA context itself, and you are at 9 GB before you have generated a single sample of audio.

That leaves roughly 3 GB for everything else. And everything else includes the worker processes, the orchestrator, the OS compositor, and whatever else the GPU is doing.

This is the math of local AI. It is not glamorous. But getting it wrong means nothing works.

The Problem: Silent OOM

The original failure mode was simple. A worker process would load the model, start generating audio, and at some unpredictable point — usually 40 minutes into an 8-hour batch run — the GPU would run out of memory. CUDA would throw an allocation error. The worker would crash. The orchestrator would wait for a heartbeat that never came. The entire pipeline would stall.

Sometimes the crash happened on segment 3. Sometimes on segment 300. There was no pattern because VRAM fragmentation is nondeterministic. CUDA allocates and frees memory in blocks, and over time, the free blocks get smaller and scattered. Eventually, a request for a contiguous block fails even though the total free memory would technically be enough.

Fragmentation is the hidden tax of long-running GPU processes. You can have 2 GB free and still fail to allocate 500 MB if that 2 GB is scattered across a hundred tiny fragments.

The Auto-Reserve System

The first fix was to stop pretending I could use all 12.3 GB. I set a hard reserve.

The system works like this. Before loading the model, the worker queries cudaMemGetInfo to check available VRAM. It then calculates a safety margin — 800 MB reserved for CUDA context, OS overhead, and fragmentation headroom. If the available VRAM after loading the model would drop below that margin, the worker refuses to start.

This sounds obvious. It was not obvious when I built it, because the original code just loaded the model and hoped for the best. The model fit, so it ran. The problem was that "it fits" and "it fits with enough headroom to run reliably for 8 hours" are different statements.

The question is never "can I load this model?" The question is "can I load this model and still handle the worst-case memory spike during inference?"

Inference memory is not constant. Some text segments produce longer audio, which means larger intermediate buffers. Some phoneme combinations cause the attention mechanism to allocate more KV cache. You need headroom for those spikes, or you will OOM on the one segment that happens to be longer than the rest.

The auto-reserve system made loading deterministic. If the worker cannot start safely, it does not start. The orchestrator logs the failure, waits for other workers to finish, and tries again when VRAM frees up. No silent crashes at 2 AM.

The flush-vram Flag

The second fix addressed fragmentation directly. Between batches — every 50 segments — the worker runs a flush cycle.

The flush is blunt. It saves the model state to CPU memory, destroys all CUDA allocations, recreates the CUDA context, and reloads the model. This takes about 4 seconds. It is the GPU equivalent of restarting your browser when it gets sluggish.

Four seconds every 50 segments is not free. Across a 600-segment book, that is 48 seconds of overhead. But the alternative is a fragmented VRAM crash that kills the pipeline and requires manual intervention. 48 seconds of prevention versus a 2 AM debugging session. The math is not hard.

The flush is triggered by a flush-vram flag in the worker configuration. It can also be triggered manually — if I notice VRAM creeping up in nvidia-smi, I can send a signal to the worker and it will flush at the next segment boundary.

I considered more sophisticated approaches. CUDA memory pools. Arena allocators. Custom allocation strategies that defragment in place. They all added complexity, and the flush approach — restart everything every 50 segments — worked. When a dumb solution solves the problem, use the dumb solution.

Why Worker Count Is a VRAM Decision

The narrator-tts pipeline spawns worker processes. Each worker loads its own copy of the model. This is deliberate — process isolation means that if one worker crashes, the others keep running. But it means each worker consumes its full VRAM footprint independently.

On my RTX 4070 Ti, one worker fits comfortably. Two workers fit if I am careful. Three is impossible.

Worker count is not a throughput decision. It is a VRAM decision. I would love to run three workers and triple my throughput. But the GPU does not care about my throughput desires. It cares about bytes.

The orchestrator checks VRAM before spawning a worker. If available VRAM is below the threshold, it does not spawn. The worker pool size is not a fixed number — it is a dynamic value determined by available GPU memory at spawn time.

This means that if I have other GPU work happening — a display update, a video decode, anything — the orchestrator might spawn fewer workers. It adapts. Not because adaptation is elegant, but because hardcoding worker count caused crashes.

The Real Bottleneck

People ask me why I do not use a bigger GPU. The honest answer is that the 4070 Ti was what I had, and the constraint forced good engineering.

When VRAM is scarce, you build systems that respect it. You pre-allocate buffers. You free memory you do not need. You batch operations to reduce allocation churn. You measure VRAM usage and set guards. You build flush cycles. You make worker counts dynamic.

When VRAM is abundant, you skip all of that. You load the model, run inference, and it works. Until it does not — at scale, under load, on a different GPU, in production. The habits you build on a constrained GPU are the habits that make your code run anywhere.

Constraints produce better engineering than abundance. The flush-vram flag, the auto-reserve system, the dynamic worker pool — none of these exist because I wanted to build them. They exist because 12.3 GB was not enough to be lazy.

The Numbers

For anyone running similar hardware, here is what works on the RTX 4070 Ti with a 1.7B parameter model:

  • Model load: 5.2 GB
  • Inference buffers (per worker): ~2.1 GB
  • CUDA context + OS overhead: ~800 MB
  • Safety reserve: 800 MB
  • Total per worker: ~8.9 GB
  • Available for second worker: ~3.4 GB — not enough

So: one worker. Reliable, repeatable, predictable. The pipeline runs for hours, flushes every 50 segments, and does not crash.

Could I optimize the model? Use quantization, reduce buffer sizes, try a smaller checkpoint? Sure. But the goal was not to squeeze maximum performance from the hardware. The goal was to produce audiobooks without waking up to a crashed pipeline. VRAM management is how you get there.

VRAM is not a spec sheet number. It is a budget. Manage it like one.