vLLM on HF Jobs One Command | Stefano Salvucci

News Summary

Hugging Face announced support for running vLLM servers directly through HF Jobs on June 26, 2026. The feature lets users start a private, OpenAI-compatible LLM endpoint on their infrastructure with a single command. It requires huggingface_hub version 1.20.0 or higher and a payment method. The approach targets quick tests, evaluations, and batch generation rather than managed production workloads.

Launching the Server

The command uses hf jobs run with the official vllm/vllm-openai image and requests GPU hardware through the --flavor flag. Port exposure happens with --expose 8000, and a timeout prevents indefinite runs. A typical invocation looks like this:

hf jobs run --flavor a10g-large --expose 8000 --timeout 2h \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000

The job returns an ID and a public URL under the hf.jobs domain. Weights download occurs on first start, after which the logs indicate the server is ready once "Application startup complete" appears. Users must keep the job ID for later reference and queries.

Accessing the API

vLLM exposes the standard OpenAI chat completions endpoint. Requests require an HF token passed as a bearer token. A curl example reads:

curl https://--8000.hf.jobs/v1/chat/completions \
-H "Authorization: Bearer $(hf auth token)" \
-H "Content-Type: application/json" \
-d '{ "model": "Qwen/Qwen3-4B", "messages": [{"role": "user", "content": "Hello!"}], "chat_template_kwargs": {"enable_thinking": false} }'

In Python the same endpoint works with the official OpenAI client by setting the base URL and passing the token from huggingface_hub.get_token(). Extra body parameters such as chat_template_kwargs pass through without modification. The response format matches the OpenAI schema exactly, so existing client code requires minimal changes.

Considerations for Production

This method bills per minute of hardware usage and stops when the timeout expires or the job is cancelled. It provides no built-in scaling, load balancing, or automatic restarts. For sustained traffic the managed Inference Endpoints service remains the documented alternative. The Jobs route suits short-lived experiments where the model fits in a single GPU and the user accepts manual job management. No persistent storage or custom networking is available beyond the exposed port.

---

📖 Related articles

Need a consultation?

I help companies and startups build software, automate workflows, and integrate AI. Let's talk.

Get in touch