Run SOTA LLMs Locally Guide | Stefano Salvucci

Summary of the Guide

Jamesob published the repository

local-llmjamesob

on GitHub in late 2024. The README describes a complete hardware and software stack for running large language models locally. It includes parts lists for a last-generation EPYC system paired with four RTX PRO 6000 cards, PCIe switch configuration for direct GPU communication, and ready-to-run Docker Compose files for both inference and speech-to-text workloads.

Hardware Choices and Costs

The documented build splits into two price tiers. At roughly $2k the author recommends a consumer platform that can still host Qwen-class models plus Whisper-large-v3 for transcription. At the high end the full $40k configuration uses a used EPYC 7002-series motherboard and DDR4 memory purchased on the secondary market to keep base-system cost near $5.6k.

The dominant expense is the four RTX PRO 6000 cards, which together provide 384 GB of VRAM. Jamesob added a PCIe Gen4 switch board from c-payne.com so the GPUs can perform all-reduce operations without routing every packet through the CPU root complex. This choice avoids the need for a full PCIe 5.0 platform while still reaching measured peer-to-peer bandwidth of 27.5 GB/s in one direction and 50.4 GB/s aggregate.

Power delivery stays within a standard 110 V circuit by applying aggressive power caps to each GPU. The author reports that the resulting thermal and electrical load remains manageable for sustained inference runs.

Kernel and BIOS Settings

Several low-level tweaks are required for stable multi-GPU operation. BIOS options must enable PCIe bifurcation on the switch-connected slots and set link speeds to Gen4. ASPM is left enabled on the switch fabric itself.

On the Linux side the author disables ACS via kernel command-line flags so that the GPUs remain visible to each other for NCCL traffic. IOMMU is turned off entirely; otherwise NCCL hangs appear during tensor-parallel inference. GRUB parameters also include explicit PCIe ASPM policy settings that prevent the switch from entering lower-power states mid-generation.

A short benchmark script included in the repo measures actual P2P latency and bandwidth after these changes. Sub-microsecond latency between cards is the target; any deviation usually points to a misconfigured root port or an ACS bit that was left enabled.

Model Serving and STT Configurations

The runners/ directory contains Docker Compose files that start vLLM with tensor parallelism across the four GPUs. One example launches GLM-5.2-594B at approximately 80 tokens per second with a 460 k context window using DCP4 and MTP5 quantization. Environment variables point to the correct NCCL and CUDA paths so that peer-to-peer traffic stays inside the switch.

A separate compose file for speech-to-text runs whisper-large-v3 in a container with GPU offload enabled. The configuration exposes a simple HTTP endpoint that accepts 16 kHz audio and returns segments with timestamps. Both setups mount the same model weights directory so disk usage does not double.

The author notes that the switch fabric reduces the need for newer server platforms, but it also introduces one extra failure mode: if the switch firmware drops a link, NCCL falls back to CPU-mediated transfers and throughput collapses. Monitoring the switch logs therefore becomes part of routine operations.

FAQs

What models run well on this hardware? The repo lists GLM-5.2-594B and several Qwen variants as the current practical choices for the 384 GB VRAM pool.

Do I need the PCIe switch? Without it the GPUs still function, but all-reduce latency rises and token throughput drops by roughly 30 % on large tensor-parallel jobs.

Can the same stack run on consumer GPUs? The Docker files work on smaller cards; only the power-limiting and switch sections become irrelevant.

---

📖 Related articles

Need a consultation?

I help companies and startups build software, automate workflows, and integrate AI. Let's talk.

Get in touch