Running Gemma 4 Locally in Codex CLI: Real-World Tests

Daniel Vaughan ran Gemma 4 on local machines, showing cost savings and better privacy for daily AI coding, reducing cloud dependency.

Running Gemma 4 Locally in Codex CLI: Real-World Tests

Summary of the News

According to Hacker News, Daniel Vaughan tested running Gemma 4 as a local model in Codex CLI on two setups: a 24 GB M4 Pro MacBook Pro and a Dell Pro Max GB10. He compared its performance for code generation tasks against a cloud-based GPT-5.4 model, focusing on cost, privacy, and reliability. The tests showed Gemma 4 handling tool calls effectively, with benchmarks indicating it's a practical alternative despite initial setup hurdles.

The Setup and Challenges

Vaughan's experiment involved configuring Gemma 4 on different hardware to mimic real-world use. On the MacBook Pro, he ran the 26B MoE variant using

llama.cppggerganov
View on GitHub →
, while the Dell machine handled the 31B Dense variant with Ollama v0.20.5. Both were integrated into Codex CLI via a config.toml file, setting wire_api to "responses" for proper output handling.

He encountered bugs right away. For instance, Ollama's v0.20.3 had a streaming issue that misrouted tool-call responses, forcing a version upgrade. This debugging took time but highlighted common pitfalls in local AI setups. From my perspective as a developer working on AI automation, these details matter because they show how even promising models like Gemma 4 require precise environment tweaks to avoid basic failures.

The process wasn't straightforward, but it underscored the importance of function-calling accuracy. Gemma 4 scored 86.4% on the tau2-bench, a sharp improvement over earlier versions, making it suitable for tasks like reading files or executing code. I find this reliability key for projects where I need agents that interact with local tools without cloud dependencies.

Results and Performance Insights

In Vaughan's tests, Gemma 4 performed well across both machines for code generation, though speeds varied. The MacBook Pro managed tasks slower due to its 24 GB limit, while the Dell's 128 GB and NVIDIA Blackwell chip delivered faster inference. He used the same prompts for all runs, providing a direct benchmark against GPT-5.4.

Key metrics included token speed and error rates. For example, the dense variant on the Dell processed prompts with fewer failures in tool calls, thanks to its architecture. This contrasts with cloud models, which offer consistent speed but at a per-token cost. In my experience with AI automation in web apps, these trade-offs are critical—local models save on expenses and keep data private, but they demand hardware that can handle large models without throttling.

Overall, the results suggest Gemma 4 is viable for everyday use in Codex CLI. It handled agentic coding tasks effectively, with the MoE variant excelling in memory-constrained environments. I see this as a solid step forward, especially since it reduces reliance on external APIs, though users must weigh hardware costs against cloud bills.

Implications for Developers

Running Gemma 4 locally shifts how we approach AI in development. For those like me, who build automation tools with Node.js and Python, it offers clear benefits: lower costs by avoiding API fees, enhanced privacy for sensitive codebases, and better uptime without server outages.

On the downside, not every setup will work seamlessly. You need at least 24 GB of RAM for decent performance, and debugging local models can eat into productive time. Compared to cloud options, Gemma 4 might lag in raw speed, but its function-calling prowess makes it a strong contender for iterative coding tasks.

In practical terms, this means developers can integrate local models into workflows using tools like

codex-clicodex-dev
View on GitHub →
. I recommend trying it if you're handling proprietary data, as it keeps everything on-device. Ultimately, it's a balanced choice that prioritizes control over convenience.

Frequently Asked Questions

What is Gemma 4? Gemma 4 is a large language model from Google, designed for efficient local inference with improved tool-calling capabilities, making it suitable for tasks like code generation in environments like Codex CLI.

How does it compare to cloud-based models like GPT-5.4? It offers similar functionality at a lower cost and with better privacy, but may run slower on standard hardware; benchmarks show it's reliable for agentic tasks once set up.

Is it worth implementing for AI automation projects? Yes, if you prioritize data security and cost savings, as long as your machine meets the memory requirements; it integrates well with existing tools for developers focused on web and AI workflows.

---

📖 Related articles

Need a consultation?

I help companies and startups build software, automate workflows, and integrate AI. Let's talk.

Get in touch
← Back to blog