Hanzo Engine

Production AI inference at any scale.

High-performance cloud inference engine serving Zen models and 60+ model architectures. CUDA, Metal, and CPU backends with paged attention, continuous batching, speculative decoding, and tensor parallelism. Built in Rust on Hanzo ML.

Multimodal: text, vision, audio, speech, image generation, embeddings, agents
APIs: Rust SDK, Python SDK, OpenAI-compatible HTTP server, MCP server and client
Performance: PagedAttention, FlashAttention V2/V3, in-situ quantization, per-layer topology
Scale: Multi-GPU via NCCL, multi-node via TCP ring, continuous batching
Ecosystem: integrates with Hanzo Edge, Hanzo Gateway, Hanzo Cloud, and Hanzo MCP

For full documentation, see docs.hanzo.ai/docs/services/engine.

Quick Start

Install

Linux / macOS:

curl -sSL https://engine.hanzo.ai/install.sh | sh

Windows (PowerShell):

irm https://engine.hanzo.ai/install.ps1 | iex

Via Cargo:

cargo install hanzo-engine

Manual installation & other platforms

Run Your First Model

# Interactive chat with a Zen model
hanzo-engine run -m zenlm/zen4-mini

# Start an OpenAI-compatible server with web UI
hanzo-engine serve --ui -m zenlm/zen4-mini --port 8000

# Serve any Hugging Face model
hanzo-engine serve -m google/gemma-3-4b-it --port 8000

Visit http://localhost:8000/ui for the web chat interface.

Docker

Pre-built images are published to ghcr.io/hanzoai/engine for every release.

Tag	Backend	Use Case
`latest`	CPU	Development, CI, ARM64 servers
`cuda`	NVIDIA CUDA	Production GPU serving
`cuda-<version>`	NVIDIA CUDA (pinned)	Reproducible deployments
`metal`	Apple Metal	macOS GPU serving

Quick Run

# CPU -- good for small models and testing
docker run -p 8000:8000 \
  -v hanzo-models:/root/.cache/huggingface \
  ghcr.io/hanzoai/engine:latest \
  serve -m zenlm/zen4-mini --port 8000

# NVIDIA GPU -- production serving
docker run -p 8000:8000 --gpus all \
  -v hanzo-models:/root/.cache/huggingface \
  ghcr.io/hanzoai/engine:cuda \
  serve -m zenlm/zen4 --port 8000

# NVIDIA GPU with quantization -- reduce VRAM usage
docker run -p 8000:8000 --gpus all \
  -v hanzo-models:/root/.cache/huggingface \
  ghcr.io/hanzoai/engine:cuda \
  serve -m zenlm/zen4 --isq Q4K --port 8000

# Apple Silicon (Metal)
docker run -p 8000:8000 \
  -v hanzo-models:/root/.cache/huggingface \
  ghcr.io/hanzoai/engine:metal \
  serve -m zenlm/zen4-mini --port 8000

# Multi-GPU with tensor parallelism
docker run -p 8000:8000 --gpus all \
  -e HANZO_ENGINE_LOCAL_WORLD_SIZE=4 \
  -v hanzo-models:/root/.cache/huggingface \
  ghcr.io/hanzoai/engine:cuda \
  serve -m zenlm/zen4-max --port 8000

# With HuggingFace token for gated models
docker run -p 8000:8000 --gpus all \
  -e HF_TOKEN=hf_your_token_here \
  -v hanzo-models:/root/.cache/huggingface \
  ghcr.io/hanzoai/engine:cuda \
  serve -m zenlm/zen4 --port 8000

Docker Compose

services:
  engine:
    image: ghcr.io/hanzoai/engine:cuda
    ports:
      - "8000:8000"
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - HANZO_ENGINE_LOCAL_WORLD_SIZE=1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - model-cache:/root/.cache/huggingface
    command: serve -m zenlm/zen4 --port 8000
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s

volumes:
  model-cache:

Kubernetes Deployment

Basic Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hanzo-engine
  labels:
    app: hanzo-engine
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hanzo-engine
  template:
    metadata:
      labels:
        app: hanzo-engine
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      containers:
        - name: engine
          image: ghcr.io/hanzoai/engine:cuda
          command: ["hanzo-engine", "serve", "-m", "zenlm/zen4", "--port", "8000"]
          ports:
            - name: http
              containerPort: 8000
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hanzo-engine-secrets
                  key: hf-token
            - name: HANZO_ENGINE_LOCAL_WORLD_SIZE
              value: "1"
          resources:
            requests:
              memory: "32Gi"
              cpu: "4"
              nvidia.com/gpu: "1"
            limits:
              memory: "64Gi"
              nvidia.com/gpu: "1"
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 30
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: hanzo-engine
spec:
  selector:
    app: hanzo-engine
  ports:
    - name: http
      port: 8000
      targetPort: 8000
  type: ClusterIP

Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: hanzo-engine-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: hanzo-engine
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: engine_requests_in_flight
        target:
          type: AverageValue
          averageValue: "10"

Multi-GPU StatefulSet

For large models requiring tensor parallelism across multiple GPUs:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: hanzo-engine-multi-gpu
spec:
  serviceName: hanzo-engine-multi-gpu
  replicas: 1
  selector:
    matchLabels:
      app: hanzo-engine-multi-gpu
  template:
    metadata:
      labels:
        app: hanzo-engine-multi-gpu
    spec:
      nodeSelector:
        nvidia.com/gpu.count: "8"
      containers:
        - name: engine
          image: ghcr.io/hanzoai/engine:cuda
          command: ["hanzo-engine", "serve", "-m", "zenlm/zen4-max", "--port", "8000"]
          env:
            - name: HANZO_ENGINE_LOCAL_WORLD_SIZE
              value: "8"
          resources:
            limits:
              nvidia.com/gpu: "8"
              memory: "640Gi"
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 64Gi
  volumeClaimTemplates:
    - metadata:
        name: model-cache
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 500Gi

Zen Models

Hanzo Engine is the inference backend for the Zen model family. First-class support for all 14 production Zen models.

Model	Parameters	Active Params	Context	Architecture	Modality	Use Case
zen4	744B MoE	40B	202K	Transformer MoE	Text	Flagship reasoning and generation
zen4-max	1.04T MoE	32B	256K	Transformer MoE	Text	Maximum capability, longest context
zen4-ultra	744B MoE + CoT	40B	202K	Transformer MoE	Text	Extended chain-of-thought reasoning
zen4-pro	80B MoE	3B	131K	Transformer MoE	Text	High quality, efficient serving
zen4-mini	8B dense	8B	40K	Transformer	Text	Fast inference, edge deployment
zen4-coder	480B MoE	35B	262K	Transformer MoE	Code	Code generation and analysis
zen4-coder-flash	30B MoE	3B	262K	Transformer MoE	Code	Fast code completion
zen4-coder-pro	480B dense BF16	480B	262K	Transformer	Code	Maximum code quality
zen3-vl	30B MoE	3B	131K	Vision-Language MoE	Text + Vision	Multimodal understanding
zen3-omni	~200B	~200B	202K	Multimodal Transformer	Text + Vision + Audio	Unified multimodal
zen3-nano	4B dense	4B	40K	Transformer	Text	Ultra-lightweight, embedded
zen3-guard	4B dense	4B	-	Classifier	Safety	Content filtering, guardrails
zen3-embedding	-	-	-	Embedding (3072-dim)	Embedding	Search, retrieval, RAG
zen-agent	-	-	-	Agent Framework	Agents	Autonomous tool use, planning

# Serve any Zen model
hanzo-engine serve -m zenlm/zen4-mini --port 8000
hanzo-engine serve -m zenlm/zen4 --port 8000 --isq Q4K
hanzo-engine serve -m zenlm/zen4-coder --port 8000
hanzo-engine serve -m zenlm/zen4-max --port 8000

# Vision model
hanzo-engine serve -m zenlm/zen3-vl --port 8000

# Embedding model
hanzo-engine serve -m zenlm/zen3-embedding --port 8000

See all Zen model weights at @zenlm on HuggingFace.

Performance Features

Category	Feature	Description
Attention	PagedAttention	High-throughput KV cache management on CUDA and Metal with prefix caching
Attention	FlashAttention V2	Memory-efficient attention for Ampere+ GPUs (CC >= 8.0)
Attention	FlashAttention V3	Optimized attention for Hopper GPUs (CC >= 9.0, H100/H200)
Batching	Continuous Batching	Dynamic request batching across all backends, enabled by default
Batching	Prompt Caching	Block-level prefix caching across requests sharing common prefixes
Decoding	Speculative Decoding	Draft-model acceleration with rejection sampling for 2-3x speedup
Decoding	Constrained Decoding	Regex, Lark grammar, JSON schema, llguidance
Memory	KV Cache Quantization	FP8 (E4M3) cache compression, halving memory with minimal quality loss
Quantization	ISQ	In-situ quantization of any HuggingFace model at load time
Quantization	GGUF	Pre-quantized 2-8 bit model loading
Quantization	GPTQ / AWQ / HQQ	Pre-quantized model formats
Quantization	FP8 / BNB	8-bit and bitsandbytes quantization
Quantization	Per-Layer Topology	Fine-tune quantization per layer for optimal quality/speed tradeoff
Quantization	UQFF	Universal Quantized File Format for portable quantized models
Parallelism	NCCL Tensor Parallelism	Multi-GPU on NVIDIA (recommended for CUDA)
Parallelism	Ring Tensor Parallelism	TCP-based, cross-device, cross-machine (Metal + CUDA + CPU)
Adapters	LoRA / X-LoRA	Runtime adapter loading with weight merging
Adapters	AnyMoE	Create mixture-of-experts on any base model
Serving	Multi-Model	Load and unload models at runtime via API
Serving	Auto-Detection	Automatically detects architecture, quantization, and chat template

Compute Backends

Backend	Platform	Hardware
CUDA	Linux, Windows (WSL)	NVIDIA GPUs (all generations)
cuDNN	Linux, Windows (WSL)	NVIDIA GPUs (optimized primitives)
Metal	macOS	Apple Silicon, AMD GPUs
Accelerate	macOS	Apple CPU optimization
MKL	Linux, Windows	Intel CPU optimization
CPU	All platforms	Any x86_64 or ARM64 processor

API Reference

Hanzo Engine exposes a fully OpenAI-compatible HTTP API. Interactive docs are available at http://localhost:<port>/docs (Swagger UI).

Endpoints

Method	Path	Description
POST	`/v1/chat/completions`	Chat completion (streaming and non-streaming)
POST	`/v1/completions`	Text completion
POST	`/v1/embeddings`	Generate embeddings
POST	`/v1/images/generations`	Image generation (FLUX models)
GET	`/v1/models`	List loaded models
POST	`/v1/models/load`	Load a model at runtime
POST	`/v1/models/unload`	Unload a model at runtime
GET	`/health`	Health check (returns 200 when ready)
GET	`/docs`	Interactive API documentation (Swagger UI)
GET	`/ui`	Built-in web chat interface (when `--ui` is enabled)

Chat Completion

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in one paragraph."}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

Streaming

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Embeddings

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "input": ["Search query", "Document passage to embed"]
  }'

Image Generation

curl http://localhost:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "prompt": "A futuristic cityscape at sunset, photorealistic",
    "n": 1,
    "size": "1024x1024"
  }'

Extended Parameters

Hanzo Engine extends the OpenAI API with additional parameters:

Parameter	Type	Description
`top_k`	int	Top-K sampling
`min_p`	float	Minimum probability threshold
`grammar`	object	Constrained decoding (regex, Lark, JSON schema, llguidance)
`enable_thinking`	bool	Enable chain-of-thought for supported models (zen4-ultra)
`web_search_options`	object	Enable web search integration
`reasoning_effort`	string	Control reasoning depth: low, medium, high
`repetition_penalty`	float	Multiplicative penalty for repeated tokens
`truncate_sequence`	bool	Truncate overlong prompts instead of rejecting

Full HTTP API documentation

Python SDK

Installation

pip install hanzo-engine               # CPU
pip install hanzo-engine-cuda          # NVIDIA GPU
pip install hanzo-engine-metal         # Apple Silicon
pip install hanzo-engine-mkl           # Intel CPU

OpenAI SDK (recommended for HTTP server)

Any OpenAI-compatible client works out of the box:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

# Basic chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "user", "content": "Write a haiku about Rust."}
    ],
    max_tokens=64,
)
print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Explain paged attention."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Vision (multimodal)

import base64

with open("image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
        ],
    }],
)
print(response.choices[0].message.content)

Embeddings

response = client.embeddings.create(
    model="default",
    input=["Search query", "Document to embed"],
)
for item in response.data:
    print(f"Embedding dimension: {len(item.embedding)}")

Tool Calling

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"},
            },
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What is the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

if response.choices[0].message.tool_calls:
    for call in response.choices[0].message.tool_calls:
        print(f"Function: {call.function.name}")
        print(f"Arguments: {call.function.arguments}")

Native Python Bindings

For direct in-process inference without an HTTP server:

from hanzo_engine import Runner, Which, ChatCompletionRequest

runner = Runner(
    which=Which.Plain(model_id="zenlm/zen4-mini"),
    in_situ_quant="4",
)

response = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=256,
    )
)
print(response.choices[0].message.content)

Python SDK docs | Installation | Examples | Cookbook

Rust SDK

cargo add hanzo-engine

use anyhow::Result;
use hanzo_engine::{IsqType, TextMessageRole, TextMessages, TextModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("zenlm/zen4-mini")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(TextMessageRole::User, "Hello!");

    let response = model.send_chat_request(messages).await?;
    println!("{}", response.choices[0].message.content.as_ref().unwrap());

    Ok(())
}

Rust API docs | Examples

CLI Reference

The hanzo-engine CLI is designed to be zero-config: point it at a model and go.

# Interactive chat
hanzo-engine run -m zenlm/zen4-mini

# HTTP server with web UI
hanzo-engine serve --ui -m zenlm/zen4 --port 8000

# Auto-tune for your hardware
hanzo-engine tune -m zenlm/zen4-mini --emit-config config.toml

# Run from generated config
hanzo-engine from-config -f config.toml

# Benchmark a model
hanzo-engine bench -m zenlm/zen4-mini

# Generate quantized UQFF file
hanzo-engine quantize -m zenlm/zen4 --isq Q4K -o zen4-q4k.uqff

# System diagnostics
hanzo-engine doctor

# Manage model cache
hanzo-engine cache list
hanzo-engine cache clean

Commands

Command	Description
`run`	Interactive chat mode
`serve`	Start OpenAI-compatible HTTP/MCP server
`from-config`	Run from TOML configuration file
`quantize`	Generate UQFF quantized model file
`tune`	Auto-benchmark and recommend settings for your hardware
`doctor`	System diagnostics (CUDA, Metal, HuggingFace connectivity)
`login`	Authenticate with HuggingFace Hub
`cache`	Manage the model cache
`bench`	Performance benchmarking

Full CLI documentation

Supported Architectures

Hanzo Engine supports 60+ model architectures across five modalities.

Text Models

Architecture	GGUF	ISQ	LoRA	AnyMoE
Llama (1/2/3/3.1/3.3)	Yes	Yes	Yes	Yes
Mistral (7B/Nemo)	Yes	Yes	Yes	Yes
Mixtral	Yes	Yes	Yes	-
Qwen 2	-	Yes	-	Yes
Qwen 3	Yes	Yes	-	-
Qwen 3 MoE	-	Yes	-	-
Qwen 3 Next	-	Yes	-	-
Gemma	-	Yes	Yes	Yes
Gemma 2	-	Yes	Yes	Yes
Phi 2	Yes	Yes	Yes	Yes
Phi 3	Yes	Yes	Yes	Yes
Phi 3.5 MoE	-	Yes	-	-
Starcoder 2	-	Yes	Yes	Yes
DeepSeek V2	-	Yes	-	-
DeepSeek V3	-	Yes	-	-
GLM 4	-	Yes	Yes	-
GLM-4.7-Flash (MoE)	-	Yes	-	-
GLM-4.7 (MoE)	-	Yes	-	-
SmolLM 3	-	Yes	Yes	Yes
Granite 4.0	-	Yes	-	-
GPT-OSS	-	Yes	-	-

Vision Models

Architecture	ISQ	LoRA	AnyMoE
Qwen 3-VL	Yes	-	-
Qwen 3-VL MoE	Yes	-	-
Qwen 2.5-VL	Yes	-	-
Qwen 2-VL	Yes	-	-
Gemma 3	Yes	-	Yes
Gemma 3n	Yes	-	-
Llama 4	Yes	-	-
Llama 3.2 Vision	Yes	-	-
Mistral 3	Yes	-	Yes
Phi 3V	Yes	-	-
Phi 4 Multimodal	Yes	-	-
MiniCPM-O 2.6	Yes	-	-
Idefics 2	Yes	-	-
Idefics 3	Yes	-	Yes
LLaVA	Yes	-	Yes
LLaVA Next	Yes	-	Yes

Speech Models

Architecture	ISQ	Description
Voxtral	Yes	ASR / speech-to-text
Dia	Yes	Text-to-speech

Image Generation Models

Architecture	Description
FLUX	High-quality image generation

Embedding Models

Architecture	ISQ	Description
Embedding Gemma	Yes	Text embeddings
Qwen 3 Embedding	Yes	Text embeddings

Deployment

Multi-GPU (NCCL)

# 2x GPU tensor parallelism
HANZO_ENGINE_LOCAL_WORLD_SIZE=2 hanzo-engine serve \
  -m zenlm/zen4 --port 8000

# 4x GPU
HANZO_ENGINE_LOCAL_WORLD_SIZE=4 hanzo-engine serve \
  -m zenlm/zen4-max --port 8000

Multi-Node (Ring)

# Node 0 (master)
RING_CONFIG=ring_node0.json hanzo-engine serve -m zenlm/zen4 --port 8000

# Node 1
RING_CONFIG=ring_node1.json hanzo-engine serve -m zenlm/zen4 --port 8001

Ring config example (ring_node0.json):

{
  "master_ip": "0.0.0.0",
  "master_port": 1234,
  "port": 12345,
  "right_port": 12346,
  "rank": 0,
  "world_size": 2
}

Ring supports heterogeneous setups: mix Metal, CUDA, and CPU nodes in a single inference cluster.

Distributed inference docs

Performance

Benchmarks measured on representative hardware with continuous batching enabled.

Model	Hardware	Quantization	Throughput (tok/s)	Latency (TTFT)	Memory
zen4-mini (8B)	1x A100 80GB	FP16	~2,400	28ms	16 GB
zen4-mini (8B)	1x A100 80GB	Q4K ISQ	~3,800	18ms	5 GB
zen4-mini (8B)	M3 Max 64GB	Metal	~85	120ms	16 GB
zen4-mini (8B)	M3 Max 64GB	Q4K ISQ	~110	80ms	5 GB
zen4-pro (80B MoE)	1x A100 80GB	Q4K ISQ	~950	65ms	42 GB
zen4 (744B MoE)	4x H100	FP8 + NCCL	~1,200	180ms	280 GB
zen4 (744B MoE)	8x A100 80GB	Q4K + NCCL	~800	250ms	320 GB

Run your own benchmarks:

hanzo-engine bench -m zenlm/zen4-mini --isq Q4K
hanzo-engine tune -m zenlm/zen4-mini --emit-config optimal.toml

Building from Source

git clone https://github.com/hanzoai/engine.git
cd engine

Feature Flags

Feature	Description	Requires
`cuda`	NVIDIA GPU acceleration	CUDA toolkit
`cudnn`	cuDNN optimized primitives	CUDA + cuDNN
`flash-attn`	FlashAttention V2	CUDA, CC >= 8.0 (Ampere+)
`flash-attn-v3`	FlashAttention V3	CUDA, CC >= 9.0 (Hopper)
`metal`	Apple Metal GPU	macOS
`accelerate`	Apple CPU optimization	macOS
`mkl`	Intel MKL CPU optimization	Intel MKL
`nccl`	Multi-GPU tensor parallelism	CUDA + NCCL
`ring`	Multi-node ring topology	TCP networking

Build by Hardware

# NVIDIA GPU (Ampere or newer, recommended)
cargo build --release --features "cuda cudnn flash-attn"

# NVIDIA Hopper (H100)
cargo build --release --features "cuda cudnn flash-attn-v3"

# NVIDIA Multi-GPU
cargo build --release --features "cuda cudnn flash-attn nccl"

# Apple Silicon
cargo build --release --features "metal accelerate"

# Intel CPU
cargo build --release --features "mkl"

# CPU only (no features needed)
cargo build --release

Requirements

Rust 1.88+
For CUDA: NVIDIA CUDA toolkit, nvcc in PATH
For Flash Attention V2: GPU compute capability >= 8.0
For Flash Attention V3: GPU compute capability >= 9.0
For MKL: Intel oneAPI or standalone MKL installation
For NCCL: NVIDIA NCCL library

Full cargo features reference

Configuration

Environment Variables

Variable	Description
`HANZO_ENGINE_LOCAL_WORLD_SIZE`	Number of GPUs for NCCL tensor parallelism
`HANZO_ENGINE_NO_NCCL=1`	Disable NCCL, use device mapping instead
`RING_CONFIG`	Path to ring topology JSON config
`KEEP_ALIVE_INTERVAL`	SSE keep-alive interval in ms
`HF_TOKEN`	HuggingFace Hub authentication token

TOML Configuration

For complex setups, use a TOML config file:

[model]
model_id = "zenlm/zen4"
isq = "Q4K"

[server]
port = 8000
log = "info"

[paged_attention]
gpu_mem_fraction = 0.9
block_size = 32
cache_type = "f8e4m3"

[speculative]
draft_model = "zenlm/zen4-mini"
gamma = 16

hanzo-engine from-config -f config.toml

Configuration reference

Documentation

Topic	Link
Full Documentation	docs.hanzo.ai/docs/services/engine
CLI Reference	docs/CLI.md
HTTP API	docs/HTTP.md
Quantization Guide	docs/QUANTS.md
ISQ (In-Situ Quantization)	docs/ISQ.md
PagedAttention	docs/PAGED_ATTENTION.md
FlashAttention	docs/FLASH_ATTENTION.md
Speculative Decoding	docs/SPECULATIVE_DECODING.md
Distributed Inference	docs/DISTRIBUTED/DISTRIBUTED.md
Device Mapping	docs/DEVICE_MAPPING.md
Per-Layer Topology	docs/TOPOLOGY.md
LoRA & X-LoRA	docs/ADAPTER_MODELS.md
AnyMoE	docs/ANYMOE.md
Tool Calling	docs/TOOL_CALLING.md
Web Search	docs/WEB_SEARCH.md
MCP Integration	docs/MCP/README.md
Cargo Features	docs/CARGO_FEATURES.md
Python SDK	docs/PYTHON_SDK.md
Rust SDK	docs/RUST_SDK.md
Troubleshooting	docs/TROUBLESHOOTING.md
Configuration	docs/CONFIGURATION.md

Related Projects

Project	Description
Hanzo Edge	On-device inference for mobile, web, and embedded (WASM, Metal, CPU)
Hanzo Gateway	API gateway with rate limiting, auth, and circuit breakers
Hanzo Ingress	L7 reverse proxy with automatic TLS and Kubernetes-native routing
Hanzo ML	Rust ML framework (tensor ops, neural networks, GPU kernels)
Hanzo Cloud	Cloud API gateway for AI inference
Hanzo LLM Gateway	Unified proxy for 100+ LLM providers
Hanzo Node	Decentralized compute node for AI workloads
Hanzo MCP	Model Context Protocol tools (260+)
Zen Models	Zen model family documentation and weights
@zenlm	Zen model weights on HuggingFace

License

MIT

Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 3,210 Commits
.cargo		.cargo
.github		.github
calibration_data		calibration_data
chat_templates		chat_templates
docs		docs
examples		examples
hanzo-engine-dso		hanzo-engine-dso
hanzo-engine		hanzo-engine
matformer_configs		matformer_configs
mistralrs-audio		mistralrs-audio
mistralrs-bench		mistralrs-bench
mistralrs-cli		mistralrs-cli
mistralrs-core		mistralrs-core
mistralrs-macros		mistralrs-macros
mistralrs-mcp		mistralrs-mcp
mistralrs-paged-attn		mistralrs-paged-attn
mistralrs-pyo3		mistralrs-pyo3
mistralrs-quant		mistralrs-quant
mistralrs-server-core		mistralrs-server-core
mistralrs-server		mistralrs-server
mistralrs-vision		mistralrs-vision
mistralrs-web-chat		mistralrs-web-chat
mistralrs		mistralrs
orderings		orderings
res		res
ring_configs		ring_configs
scripts		scripts
site		site
toml-selectors		toml-selectors
topologies		topologies
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.typos.toml		.typos.toml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile.cuda-13.0-ubi9		Dockerfile.cuda-13.0-ubi9
Dockerfile.cuda-all		Dockerfile.cuda-all
Dockerfile.manylinux		Dockerfile.manylinux
EMBEDDINGS_FIX.md		EMBEDDINGS_FIX.md
HANZO_ENGINE_INTEGRATION.md		HANZO_ENGINE_INTEGRATION.md
LICENSE		LICENSE
LLM.md		LLM.md
Makefile		Makefile
README.md		README.md
install.ps1		install.ps1
install.sh		install.sh
sample_speech.wav		sample_speech.wav
speculative.toml		speculative.toml
test_embeddings.py		test_embeddings.py

License

hanzoai/engine

Folders and files

Latest commit

History

Repository files navigation

Hanzo Engine

Production AI inference at any scale.

Quick Start

Install

Run Your First Model

Docker

Quick Run

Docker Compose

Kubernetes Deployment

Basic Deployment

Horizontal Pod Autoscaler

Multi-GPU StatefulSet

Zen Models

Performance Features

Compute Backends

API Reference

Endpoints

Chat Completion

Streaming

Embeddings

Image Generation

Extended Parameters

Python SDK

Installation

OpenAI SDK (recommended for HTTP server)

Streaming

Vision (multimodal)

Embeddings

Tool Calling

Native Python Bindings

Rust SDK

CLI Reference

Commands

Supported Architectures

Text Models

Vision Models

Speech Models

Image Generation Models

Embedding Models

Deployment

Multi-GPU (NCCL)

Multi-Node (Ring)

Performance

Building from Source

Feature Flags

Build by Hardware

Requirements

Configuration

Environment Variables

TOML Configuration

Documentation

Related Projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages