News Summary
Hugging Face announced support for running vLLM servers directly through HF Jobs on June 26, 2026. The feature lets users start a private, OpenAI-compatible LLM endpoint on their infrastructure with a single command. It requires huggingface_hub version 1.20.0 or higher and a payment method. The approach targets quick tests, evaluations, and batch generation rather than managed production workloads.Launching the Server
The command useshf jobs run with the official vllm/vllm-openai image and requests GPU hardware through the --flavor flag. Port exposure happens with --expose 8000, and a timeout prevents indefinite runs. A typical invocation looks like this:
hf jobs run --flavor a10g-large --expose 8000 --timeout 2h \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000
The job returns an ID and a public URL under the hf.jobs domain. Weights download occurs on first start, after which the logs indicate the server is ready once "Application startup complete" appears. Users must keep the job ID for later reference and queries.
Accessing the API
vLLM exposes the standard OpenAI chat completions endpoint. Requests require an HF token passed as a bearer token. A curl example reads:curl https://--8000.hf.jobs/v1/chat/completions \
-H "Authorization: Bearer $(hf auth token)" \
-H "Content-Type: application/json" \
-d '{ "model": "Qwen/Qwen3-4B", "messages": [{"role": "user", "content": "Hello!"}], "chat_template_kwargs": {"enable_thinking": false} }'
In Python the same endpoint works with the official OpenAI client by setting the base URL and passing the token from huggingface_hub.get_token(). Extra body parameters such as chat_template_kwargs pass through without modification. The response format matches the OpenAI schema exactly, so existing client code requires minimal changes.
Considerations for Production
This method bills per minute of hardware usage and stops when the timeout expires or the job is cancelled. It provides no built-in scaling, load balancing, or automatic restarts. For sustained traffic the managed Inference Endpoints service remains the documented alternative. The Jobs route suits short-lived experiments where the model fits in a single GPU and the user accepts manual job management. No persistent storage or custom networking is available beyond the exposed port.---
๐ Related articles
- Agentic Coding: Una Trappola per lo Sviluppo Software?
- File agents.md: utili per gli agenti di coding?
- Lean-ctx: Ottimizzatore Ibrido Riduce Consumo Token LLM del 89-99%
Need a consultation?
I help companies and startups build software, automate workflows, and integrate AI. Let's talk.
Get in touch