Setting Up a Local Coding Agent on macOS

Step-by-step setup of Gemma 4 with llama.cpp on Apple Silicon for a fast, private coding agent with OpenAI-compatible API and image support.

Setting Up a Local Coding Agent on macOS

Local Setup Details from Recent Report

A post on Hacker News described a configuration for running a local coding agent on macOS. The author used llama.cpp built with Metal support, the Gemma 4 26B-A4B model in GGUF format, a Q8 MTP draft model, and a multimodal projector. The setup ran on an M1 Max with 64 GB memory under macOS 15.7.7. It exposed an OpenAI-compatible API and supported image inputs for tasks such as reviewing screenshots of generated code output.

Model Files and llama.cpp Configuration

The primary model file is gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf, roughly 16 GB in size. Adding the MTP draft file and projector brings the total folder size to about 17 GB. Commands load the main model with Metal offload and flash attention enabled:

repos/llama.cpp/build/bin/llama-cli \
  -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  -ngl 999 -fa on -c 4096 -n 128

This produces 58.2 tokens per second on generation after a 298 tokens-per-second prompt phase. The same binary accepts the draft model through additional flags for speculative decoding:

--model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
--spec-type draft-mtp --spec-draft-n-max 3

Testing showed the optimum at three draft tokens, raising generation speed to 72.2 tokens per second.

Measured Speed and API Exposure

Baseline runs without the draft model stayed at 58 tokens per second. Enabling the MTP head delivered a measured 24 percent increase while keeping prompt evaluation nearly unchanged. The resulting server instance can be pointed at by any client expecting an OpenAI endpoint, which satisfies the requirement for compatibility with existing agent tooling. Image support comes from the multimodal projector, allowing the agent to receive screenshots without extra preprocessing steps. Memory usage on the tested M1 Max stayed within the 64 GB unified pool when context length was capped at 4096 tokens.

Practical Constraints for Daily Use

The configuration requires a recent macOS version and sufficient unified memory; machines with 32 GB or less will hit swapping once context grows. Model download size exceeds 17 GB, so initial setup time depends on connection speed. Generation remains below 80 tokens per second even with speculative decoding, which limits interactive loops that expect sub-second responses for every tool call. No additional fine-tuning steps are described, so output quality rests on the base Gemma 4 checkpoint. The terminal agent component, referred to as Pi, consumes the OpenAI-compatible endpoint directly and handles screenshot uploads without custom glue code.

FAQs

What hardware is required? An Apple Silicon Mac with at least 64 GB unified memory is specified for the reported speeds.

Does the setup need an internet connection after download? No. All inference runs locally once the GGUF files are present.

Can existing OpenAI client libraries connect to it? Yes. The llama.cpp server exposes the standard chat completions endpoint.

---

๐Ÿ“– Related articles

Need a consultation?

I help companies and startups build software, automate workflows, and integrate AI. Let's talk.

Get in touch
โ† Back to blog