ROCm 7.2 on Strix Halo Posts Eye-Opening Local LLM Benchmarks, but It’s a Validation Story, Not a Launch

What changed

A new community benchmark drop showed llama-bench running on ROCm 7.2 with AMD’s Ryzen AI Max+ 395 platform, specifically Strix Halo with Radeon 8060S graphics at 40 CUs and 128GB unified memory, on Fedora with kernel 6.18.13. The test stack also pinned llama.cpp to commit d417bc43, and most model variants were Unsloth UD quantizations, which matters because quant format can shift throughput hard.

The headline numbers are concrete and broad across model sizes. Qwen3.5-0.8B Q4 was reported at about 5968 pp512/s and 175.8 tg128/s. Qwen3.5-35B-A3B Q4 came in around 887 pp512/s and 39.7 tg128/s. Qwen3.5-122B-A10B Q4 still posted roughly 268 pp512/s and 21.3 tg128/s, while GLM-4.7-Flash Q4 hit about 916.6 pp512/s and 46.3 tg128/s, and GPT-OSS-120B Q8 reported around 499.4 pp512/s and 42.1 tg128/s.

Why it matters

These results suggest a meaningful shift for local AI builders: on this APU class, very large models can move from “possible” to “interactive enough” for real sessions, especially with MoE architectures where active parameters are far below total parameter count. That changes planning for indie developers, small studios, and creators who want serious local inference without immediately jumping to a discrete GPU stack.

The data also reinforces a familiar but now measurable tradeoff on this hardware tier: Q4 gives stronger speed for chat-like latency targets, while Q8 or BF16 trends toward better output fidelity at higher compute cost. That split is practical, not theoretical, and it directly affects product UX, inference budgeting, and model routing decisions.

What to do next

Use these numbers as directional evidence, then verify on your own workloads before making architecture commitments. Reproduce with the same llama.cpp revision, run llama-bench on your actual prompt and generation profiles, and compare ROCm against Vulkan using the side-by-side approach the author referenced. If your use case is interactive tools, assistants, or live drafting, start with Q4 baselines and measure tail latency. If you’re building quality-critical writing, coding review, or compliance-sensitive generation, test higher precision and accept slower throughput where accuracy gains justify it.

Teams that benefit most right now are local-first app developers, privacy-focused orgs, and content creators pushing bigger contexts on limited compute. Just don’t treat one setup as universal truth yet; treat it as a strong signal to benchmark aggressively on your own hardware mix and model portfolio. Source: Reddit benchmark post

ROCm 7.2 on Strix Halo Posts Eye-Opening Local LLM Benchmarks, but It’s a Validation Story, Not a Launch

ROCm 7.2 on Strix Halo Posts Eye-Opening Local LLM Benchmarks, but It’s a Validation Story, Not a Launch

What changed

Why it matters

What to do next

Want to actually USE these AI breakthroughs?