Orthrus: Fast Lossless LLM Inference

Orthrus Framework Overview

orthruschiennv2000

introduces a dual-architecture system that pairs standard autoregressive decoding with diffusion-based parallel generation. The project released model checkpoints built on Qwen3 backbones in sizes 1.7B, 4B, and 8B. Each checkpoint guarantees bit-for-bit identical output to the original autoregressive model while reporting measured speedups between 4.25× and 5.36× on typical generation workloads. The repository appeared on GitHub Trending with 284 stars and a short demonstration video showing streaming output.

Dual-View Diffusion Mechanism

The core change replaces sequential next-token prediction with a diffusion process that generates multiple tokens in one forward pass. Two separate views of the same hidden state run in parallel: one maintains the exact probability distribution of the base LLM, the other proposes candidate tokens across several positions simultaneously. An intra-model consensus step discards any token that fails to match the autoregressive distribution, preserving strict losslessness. Because the diffusion branch operates on the same weights, no additional training is required beyond the provided checkpoints.

The approach removes the sequential bottleneck that limits standard decoding on long outputs. In practice this shows clearest gains on tasks with high token counts such as code generation or structured JSON responses. Shorter prompts see smaller absolute improvements since the fixed overhead of the diffusion scheduler becomes relatively larger.

Measured Speedups and Trade-offs

The released models list the following average speedups on the project’s evaluation set:

Orthrus-Qwen3-1.7B: 4.25×
Orthrus-Qwen3-4B: 5.20×
Orthrus-Qwen3-8B: 5.36×

Peak observed speedup reaches 7.8× on longer sequences. Memory usage stays comparable to the base Qwen3 models because the diffusion head reuses existing layers. The main cost is an extra forward pass per diffusion step, which the authors mitigate by limiting the number of diffusion iterations.

Current limitations include lack of native vLLM or SGLang integration, so throughput on multi-user servers still requires custom scheduling. The repository states that these integrations are planned but not yet available. Flash attention 2 is required for the reported numbers; switching to eager attention drops the speedup by roughly 30 % on an A100.

Installation and Basic Usage

Install the package with uv for fastest dependency resolution:

uv pip install -e .
uv pip install ninja packaging
uv pip install flash-attn --no-build-isolation

A minimal generation script loads the 8B checkpoint and enables diffusion mode:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
model = AutoModelForCausalLM.from_pretrained(
    "chiennv/Orthrus-Qwen3-8B",
    dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
).eval()
tokenizer = AutoTokenizer.from_pretrained("chiennv/Orthrus-Qwen3-8B")
prompt = "Write a program to count the frequency of each word in a paragraph."
messages = [{"role": "system", "content": ""}, {"role": "user", "content": prompt}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
output = model.generate(
    input_ids=input_ids,
    max_new_tokens=2048,
    use_diffusion_mode=True,
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)

The use_diffusion_mode flag switches the generation loop to the dual-view scheduler. All other generation parameters such as temperature or top-p continue to function as usual.

FAQs

Does Orthrus change model outputs compared with the base Qwen3 checkpoint? No. The consensus step enforces exact agreement with the autoregressive distribution, so generated text matches the original model token for token.

Can I use Orthrus checkpoints inside an existing vLLM deployment today? Not yet. The repository notes that vLLM and SGLang support are scheduled but currently absent, requiring direct Hugging Face loading.

What hardware is required to match the published speedups? Results were measured on NVIDIA A100 GPUs with flash attention 2 enabled. Lower-end cards still run the models but see reduced relative gains because the diffusion overhead becomes more noticeable.

---

📖 Related articles

Need a consultation?

I help companies and startups build software, automate workflows, and integrate AI. Let's talk.

Get in touch