128GB Local AI Rig Changes the Game, but Multi-GPU Correctness Is the Real Story

What changed

A local inference setup moved from a single 96GB RTX Pro 6000 to a heterogeneous 128GB pool by adding a 32GB RTX 5090 over Thunderbolt 4. That is a 33% jump in aggregate VRAM capacity, from 96GB to 128GB, and it directly changes which model and quantization combinations can load at all. The practical threshold shift matters most for workloads that were failing on memory allocation, not just running slowly.

A second change is behavioral, not hardware: llama.cpp reportedly produced random-token output with Qwen 3.5 when running multi-GPU under default -sm layer, while -sm row and single-GPU runs produced coherent output. That creates a clear A/B signal tied to sharding strategy. In other words, the same model family can move from unusable to sane output based purely on split mode.

Why it matters

For coding-heavy local use, memory ceilings often block progress before raw throughput does. Crossing from 96GB to 128GB can unlock larger parameter footprints, less aggressive quantization, or larger KV/context allocations that were previously out of reach. If your target model sits near the old boundary, this upgrade can be the difference between “won’t load” and “usable daily driver.”

But reliability beats capacity. If a multi-GPU partition path can corrupt generation, added VRAM does not translate to trustworthy coding output. The technical takeaway is specific: partition mode is a first-order variable for output validity on at least one observed Qwen 3.5 path, so performance benchmarking alone is not enough.

What to do next

Treat this as a fit-and-correctness campaign. First, test models and quants that are just above the 96GB envelope and below 128GB, because that range is where the upgrade should produce immediate new options. Second, run deterministic prompt sets focused on coding tasks and token stability, so you can compare exact behavior across split modes and GPU layouts.

Keep -sm row or single-GPU as your known-good baseline for Qwen 3.5 until -sm layer behavior is explained or fixed in newer llama.cpp builds. Teams that build local coding assistants, solo developers optimizing private workflows, and labs validating offline inference stacks benefit most from this approach. The win is not “more VRAM equals better output.” The win is proving correctness first, then locking in the biggest model that remains stable.

Source: Reddit discussion on 96GB to 128GB VRAM for local LLMs