Rotary GPU: Local MoE on 8GB VRAM

Research Summary

A paper submitted to arXiv on 27 May 2026 describes Rotary GPU, an execution method for running large Mixture-of-Experts models on consumer hardware with restricted GPU memory. The author tested a Qwen3.6-35B-A3B-class model on an RTX 4060 Laptop GPU with 8 GB VRAM. The system produced 2048 output tokens at roughly 6.3 GB memory usage and 21.06 tokens per second. The work focuses on deployment constraints rather than new model architectures.

Technical Approach

Rotary GPU builds on an earlier rotary-based accelerator residency idea. It keeps only active expert parameters in VRAM while swapping inactive ones through a residency mechanism tied to the rotary position embeddings already present in many transformer stacks. The method avoids full model loading and relies on selective expert activation patterns common in MoE designs.

The implementation targets inference rather than training. No custom kernels or new attention variants are introduced; instead the approach modifies the execution scheduler to respect strict VRAM caps. The paper reports results on a single consumer laptop without external accelerators or model quantization beyond what the base checkpoint already used.

Measured Performance

Under the reported configuration the model sustained 21.06 tokens per second while generating 2048 tokens. Peak VRAM stayed near 6.3 GB, leaving headroom on the 8 GB card. These figures come from a single primary run; the paper does not provide multi-run averages or comparisons against offloading baselines such as CPU layers or disk swapping.

The model in question contains 35 billion total parameters but activates only a subset per token due to its MoE routing. This sparsity is essential to the observed memory footprint. Without the MoE structure the same technique would not fit the same hardware.

Practical Trade-offs

For developers working under hardware or network limits, the numbers indicate that mid-sized MoE checkpoints can run locally without data-center resources. The decode speed remains usable for interactive tasks, though prompt processing latency is not detailed. The approach requires the model to expose clear expert boundaries; dense models would need additional partitioning logic.

Memory stability depends on the routing behavior staying predictable. Sudden spikes in active experts could exceed the measured 6.3 GB. The paper frames the results as exploratory, so production use would require further validation on target hardware and workloads. No code release is mentioned, limiting immediate experimentation.

---

📖 Related articles

Need a consultation?

I help companies and startups build software, automate workflows, and integrate AI. Let's talk.

Get in touch