The Lightweight OpenAI API Server

🔒 Local Inference Without Dependencies 🚀

Shimmy will be free forever. No asterisks. No "free for now." No pivot to paid.

💝 Support Shimmy's Growth

🚀 If Shimmy helps you, consider sponsoring — 100% of support goes to keeping it free forever.

$5/month: Coffee tier ☕ - Eternal gratitude + sponsor badge
$25/month: Bug prioritizer 🐛 - Priority support + name in SPONSORS.md
$100/month: Corporate backer 🏢 - Logo placement + monthly office hours
$500/month: Infrastructure partner 🚀 - Direct support + roadmap input

🎯 Become a Sponsor | See our amazing sponsors 🙏

Drop-in OpenAI API Replacement for Local LLMs

Shimmy is a 4.8MB single-binary that provides 100% OpenAI-compatible endpoints for GGUF models. Point your existing AI tools to Shimmy and they just work — locally, privately, and free.

Developer Tools

Whether you're forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.

Try it in 30 seconds

# 1) Install + run
cargo install shimmy --features huggingface
shimmy serve &

# 2) See models and pick one
shimmy list

# 3) Smoke test the OpenAI API
curl -s http://127.0.0.1:11435/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
        "model":"REPLACE_WITH_MODEL_FROM_list",
        "messages":[{"role":"user","content":"Say hi in 5 words."}],
        "max_tokens":32
      }' | jq -r '.choices[0].message.content'

🚀 Compatible with OpenAI SDKs and Tools

No code changes needed - just change the API endpoint:

Any OpenAI client: Python, Node.js, curl, etc.
Development applications: Compatible with standard SDKs
VSCode Extensions: Point to http://localhost:11435
Cursor Editor: Built-in OpenAI compatibility
Continue.dev: Drop-in model provider

Use with OpenAI SDKs

Node.js (openai v4)

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "http://127.0.0.1:11435/v1",
  apiKey: "sk-local", // placeholder, Shimmy ignores it
});

const resp = await openai.chat.completions.create({
  model: "REPLACE_WITH_MODEL",
  messages: [{ role: "user", content: "Say hi in 5 words." }],
  max_tokens: 32,
});

console.log(resp.choices[0].message?.content);

Python (openai>=1.0.0)

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local")

resp = client.chat.completions.create(
    model="REPLACE_WITH_MODEL",
    messages=[{"role": "user", "content": "Say hi in 5 words."}],
    max_tokens=32,
)

print(resp.choices[0].message.content)

⚡ Zero Configuration Required

Automatically finds models from Hugging Face cache, Ollama, local dirs
Auto-allocates ports to avoid conflicts
Auto-detects LoRA adapters for specialized models
Just works - no config files, no setup wizards

🧠 Advanced MOE (Mixture of Experts) Support

Run 70B+ models on consumer hardware with intelligent CPU/GPU hybrid processing:

🔄 CPU MOE Offloading: Automatically distribute model layers across CPU and GPU
🧮 Intelligent Layer Placement: Optimizes which layers run where for maximum performance
💾 Memory Efficiency: Fit larger models in limited VRAM by using system RAM strategically
⚡ Hybrid Acceleration: Get GPU speed where it matters most, CPU reliability everywhere else
🎛️ Configurable: --cpu-moe and --n-cpu-moe flags for fine control

# Enable MOE CPU offloading during installation
cargo install shimmy --features moe

# Run with MOE hybrid processing
shimmy serve --cpu-moe --n-cpu-moe 8

# Automatically balances: GPU layers (fast) + CPU layers (memory-efficient)

Perfect for: Large models (70B+), limited VRAM systems, cost-effective inference

🎯 Perfect for Local Development

Privacy: Your code never leaves your machine
Cost: No API keys, no per-token billing
Speed: Local inference, sub-second responses
Reliability: No rate limits, no downtime

Quick Start (30 seconds)

Installation

🪟 Windows

# RECOMMENDED: Use pre-built binary (no build dependencies required)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy.exe -o shimmy.exe

# OR: Install from source with MOE support
# First install build dependencies:
winget install LLVM.LLVM
# Then install shimmy with MOE:
cargo install shimmy --features moe

# For CUDA + MOE hybrid processing:
cargo install shimmy --features llama-cuda,moe

⚠️ Windows Notes:

Pre-built binary recommended to avoid build dependency issues

MSVC compatibility: Uses shimmy-llama-cpp-2 packages for better Windows support

If Windows Defender flags the binary, add an exclusion or use cargo install

For cargo install: Install LLVM first to resolve libclang.dll errors

🍎 macOS / 🐧 Linux

# Install from crates.io
cargo install shimmy --features huggingface

GPU Acceleration

Shimmy supports multiple GPU backends for accelerated inference:

🖥️ Available Backends

Backend	Hardware	Installation
CUDA	NVIDIA GPUs	`cargo install shimmy --features llama-cuda`
CUDA + MOE	NVIDIA GPUs + CPU	`cargo install shimmy --features llama-cuda,moe`
Vulkan	Cross-platform GPUs	`cargo install shimmy --features llama-vulkan`
OpenCL	AMD/Intel/Others	`cargo install shimmy --features llama-opencl`
MLX	Apple Silicon	`cargo install shimmy --features mlx`
MOE Hybrid	Any GPU + CPU	`cargo install shimmy --features moe`
All Features	Everything	`cargo install shimmy --features gpu,moe`

🔍 Check GPU Support

# Show detected GPU backends
shimmy gpu-info

⚡ Usage Notes

GPU backends are automatically detected at runtime
Falls back to CPU if GPU is unavailable
Multiple backends can be compiled in, best one selected automatically
Use --gpu-backend <backend> to force specific backend

Get Models

Shimmy auto-discovers models from:

Hugging Face cache: ~/.cache/huggingface/hub/
Ollama models: ~/.ollama/models/
Local directory: ./models/
Environment: SHIMMY_BASE_GGUF=path/to/model.gguf

# Download models that work out of the box
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir ./models/
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --local-dir ./models/

Start Server

# Auto-allocates port to avoid conflicts
shimmy serve

# Or use manual port
shimmy serve --bind 127.0.0.1:11435

Point your development tools to the displayed port — VSCode Copilot, Cursor, Continue.dev all work instantly.

📦 Download & Install

Package Managers

Rust: cargo install shimmy --features moe (recommended)
Rust (basic): cargo install shimmy
VS Code: Shimmy Extension
Windows MSVC: Uses shimmy-llama-cpp-2 packages for better compatibility
npm: npm install -g shimmy-js (planned)
Python: pip install shimmy (planned)

Direct Downloads

GitHub Releases: Latest binaries
Docker: docker pull shimmy/shimmy:latest (coming soon)

🍎 macOS Support

Full compatibility confirmed! Shimmy works flawlessly on macOS with Metal GPU acceleration.

# Install dependencies
brew install cmake rust

# Install shimmy
cargo install shimmy

✅ Verified working:

Intel and Apple Silicon Macs
Metal GPU acceleration (automatic)
MLX native acceleration for Apple Silicon
Xcode 17+ compatibility
All LoRA adapter features

Integration Examples

VSCode Copilot

{
  "github.copilot.advanced": {
    "serverUrl": "http://localhost:11435"
  }
}

Continue.dev

{
  "models": [{
    "title": "Local Shimmy",
    "provider": "openai",
    "model": "your-model-name",
    "apiBase": "http://localhost:11435/v1"
  }]
}

Cursor IDE

Works out of the box - just point to http://localhost:11435/v1

Why Shimmy Will Always Be Free

I built Shimmy to retain privacy-first control on my AI development and keep things local and lean.

This is my commitment: Shimmy stays MIT licensed, forever. If you want to support development, sponsor it. If you don't, just build something cool with it.

💡 Shimmy saves you time and money. If it's useful, consider sponsoring for $5/month — less than your Netflix subscription, infinitely more useful for developers.

API Reference

Endpoints

GET /health - Health check
POST /v1/chat/completions - OpenAI-compatible chat
GET /v1/models - List available models
POST /api/generate - Shimmy native API
GET /ws/generate - WebSocket streaming

CLI Commands

shimmy serve                    # Start server (auto port allocation)
shimmy serve --bind 127.0.0.1:8080  # Manual port binding
shimmy serve --cpu-moe --n-cpu-moe 8  # Enable MOE CPU offloading
shimmy list                     # Show available models (LLM-filtered)
shimmy discover                 # Refresh model discovery
shimmy generate --name X --prompt "Hi"  # Test generation
shimmy probe model-name         # Verify model loads
shimmy gpu-info                 # Show GPU backend status

Technical Architecture

Rust + Tokio: Memory-safe, async performance
llama.cpp backend: Industry-standard GGUF inference
OpenAI API compatibility: Drop-in replacement
Dynamic port management: Zero conflicts, auto-allocation
Zero-config auto-discovery: Just works™

🚀 Advanced Features

🧠 MOE CPU Offloading: Hybrid GPU/CPU processing for large models (70B+)
🎯 Smart Model Filtering: Automatically excludes non-language models (Stable Diffusion, Whisper, CLIP)
🛡️ 6-Gate Release Validation: Constitutional quality limits ensure reliability
⚡ Smart Model Preloading: Background loading with usage tracking for instant model switching
💾 Response Caching: LRU + TTL cache delivering 20-40% performance gains on repeat queries
🚀 Integration Templates: One-command deployment for Docker, Kubernetes, Railway, Fly.io, FastAPI, Express
🔄 Request Routing: Multi-instance support with health checking and load balancing
📊 Advanced Observability: Real-time metrics with self-optimization and Prometheus integration
🔗 RustChain Integration: Universal workflow transpilation with workflow orchestration

Community & Support

🐛 Bug Reports: GitHub Issues
💬 Discussions: GitHub Discussions
📖 Documentation: docs/ • Engineering Methodology • OpenAI Compatibility Matrix • Benchmarks (Reproducible)
💝 Sponsorship: GitHub Sponsors

Star History

🚀 Momentum Snapshot

📦 Sub-5MB single binary (142x smaller than Ollama) 🌟 stars and climbing fast ⏱ <1s startup 🦀 100% Rust, no Python

📰 As Featured On

🔥 Hacker News • Front Page Again • IPE Newsletter

Companies: Need invoicing? Email [email protected]

⚡ Performance Comparison

Tool	Binary Size	Startup Time	Memory Usage	OpenAI API
Shimmy	4.8MB	<100ms	50MB	100%
Ollama	680MB	5-10s	200MB+	Partial
llama.cpp	89MB	1-2s	100MB	Via llama-server

Quality & Reliability

Shimmy maintains high code quality through comprehensive testing:

Comprehensive test suite with property-based testing
Automated CI/CD pipeline with quality gates
Runtime invariant checking for critical operations
Cross-platform compatibility testing

Development Testing

Run the complete test suite:

# Using cargo aliases
cargo test-quick           # Quick development tests

# Using Makefile  
make test                  # Full test suite
make test-quick            # Quick development tests

See our testing approach for technical details.

License & Philosophy

MIT License - forever and always.

Philosophy: Infrastructure should be invisible. Shimmy is infrastructure.

Testing Philosophy: Reliability through comprehensive validation and property-based testing.

Forever maintainer: Michael A. Kuykendall Promise: This will never become a paid product Mission: Making local model inference simple and reliable

Name		Name	Last commit message	Last commit date
Latest commit History 313 Commits
.cargo		.cargo
.github		.github
.internal		.internal
assets		assets
benches		benches
deploy		deploy
docs		docs
libs		libs
memory		memory
packaging		packaging
scripts		scripts
specs		specs
src		src
templates		templates
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.mailmap		.mailmap
.pre-commit-config.yaml		.pre-commit-config.yaml
.skip-ci-tests		.skip-ci-tests
AWESOME_LIST_PROMOTIONS.md		AWESOME_LIST_PROMOTIONS.md
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
COMPREHENSIVE_MOE_STREAMING_WHITEPAPER.md		COMPREHENSIVE_MOE_STREAMING_WHITEPAPER.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Cross.toml		Cross.toml
DCO.md		DCO.md
DEVELOPERS.md		DEVELOPERS.md
Dockerfile		Dockerfile
ISSUE_ANALYSIS.md		ISSUE_ANALYSIS.md
Issue_108_Response.md		Issue_108_Response.md
LICENSE		LICENSE
LOCAL_GITHUB_ACTIONS_GUIDE.md		LOCAL_GITHUB_ACTIONS_GUIDE.md
LOCAL_MOE_STREAMING_VALIDATION.md		LOCAL_MOE_STREAMING_VALIDATION.md
LOCAL_STREAMING_BENCHMARK_PROTOCOL.md		LOCAL_STREAMING_BENCHMARK_PROTOCOL.md
MLX_IMPLEMENTATION_PLAN.md		MLX_IMPLEMENTATION_PLAN.md
MOE_TEMPERATURE_SOLUTION.md		MOE_TEMPERATURE_SOLUTION.md
Makefile		Makefile
MoE_Fix_Forensic_Audit_Report.md		MoE_Fix_Forensic_Audit_Report.md
MoE_Fix_Forensic_Audit_Report_v2.md		MoE_Fix_Forensic_Audit_Report_v2.md
README-DOCKER.md		README-DOCKER.md
README.md		README.md
RELEASE_GATES_CHECKLIST.md		RELEASE_GATES_CHECKLIST.md
RELEASE_PREP_V1.7.2.md		RELEASE_PREP_V1.7.2.md
RELEASE_PROCESS.md		RELEASE_PROCESS.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
SPONSORS.md		SPONSORS.md
build.rs		build.rs
deny.toml		deny.toml
docker-compose.yml		docker-compose.yml
execute_streaming_benchmarks.py		execute_streaming_benchmarks.py
release-notes-v1.7.0.md		release-notes-v1.7.0.md
test-gpt-oss.sh		test-gpt-oss.sh
test-moe-fix-verification.sh		test-moe-fix-verification.sh
test-moe-offloading.sh		test-moe-offloading.sh

Uh oh!

License

Michael-A-Kuykendall/shimmy

Folders and files

Latest commit

History

Repository files navigation

The Lightweight OpenAI API Server

🔒 Local Inference Without Dependencies 🚀

💝 Support Shimmy's Growth

Drop-in OpenAI API Replacement for Local LLMs

Developer Tools

Try it in 30 seconds

🚀 Compatible with OpenAI SDKs and Tools

Use with OpenAI SDKs

⚡ Zero Configuration Required

🧠 Advanced MOE (Mixture of Experts) Support

🎯 Perfect for Local Development

Quick Start (30 seconds)

Installation

🪟 Windows

🍎 macOS / 🐧 Linux

GPU Acceleration

🖥️ Available Backends

🔍 Check GPU Support

⚡ Usage Notes

Get Models

Start Server

📦 Download & Install

Package Managers

Direct Downloads

🍎 macOS Support

Integration Examples

VSCode Copilot

Continue.dev

Cursor IDE

Why Shimmy Will Always Be Free

API Reference

Endpoints

CLI Commands

Technical Architecture

🚀 Advanced Features

Community & Support

Star History

🚀 Momentum Snapshot

📰 As Featured On

⚡ Performance Comparison

Quality & Reliability

Development Testing

License & Philosophy

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 27

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 4

Languages

Packages