GitHub - jd-opensource/xllm: A high-performance inference engine for LLMs, optimized for diverse AI accelerators.

| Documentation | Technical Report |

📢 News

2025-12-21: 🎉 We day-0 support high-performance inference for the GLM-4.7 model.
2025-12-08: 🎉 We day-0 support high-performance inference for the GLM-4.6V model.
2025-12-05: 🎉 We now support high-performance inference for the GLM-4.5/GLM-4.6 series models.
2025-12-05: 🎉 We now support high-performance inference for the VLM-R1 model.
2025-12-05: 🎉 We build hybrid KV cache management based on Mooncake, supporting global KV cache management with intelligent offloading and prefetching.
2025-10-16: 🎉 We recently have released our xLLM Technical Report on arXiv, providing comprehensive technical blueprints and implementation insights.

1. Project Overview

xLLM is an efficient LLM inference framework, specifically optimized for Chinese AI accelerators, enabling enterprise-grade deployment with enhanced efficiency and reduced cost. The framework adopts a service-engine decoupled inference architecture, achieving breakthrough efficiency through several technologies: at the service layer, including elastic scheduling of online/offline requests, dynamic PD disaggregation, a hybrid EPD mechanism for multimodal and high-availability fault tolerance; and at the engine layer, combined with technologies such as multi-stream parallel computing, graph fusion optimization, speculative inference, dynamic load balancing and global KV cache management. The overall architecture is shown below:

xLLM already supports efficient deployment of mainstream large models (such as DeepSeek-V3.1, Qwen2/3, etc.) on Chinese AI accelerators, empowering enterprises to implement high-performance, low-cost AI large model applications. xLLM has been fully deployed in JD.com’s real core retail businesses, covering a variety of scenarios including intelligent customer service, risk control, supply chain optimization, ad recommendation, and more.

2. Core Features

xLLM delivers robust intelligent computing capabilities. By leveraging hardware system optimization and algorithm-driven decision control, it jointly accelerates the inference process, enabling high-throughput, low-latency distributed inference services.

Full Graph Pipeline Execution Orchestration

Asynchronous decoupled scheduling at the requests scheduling layer, to reduce computational bubbles.
Asynchronous parallelism of computation and communication at the model graph layer, overlapping computation and communication.
Pipelining of heterogeneous computing units at the operator kernel layer, overlapping computation and memory access.

Graph Optimization for Dynamic Shapes

Dynamic shape adaptation based on parameterization and multi-graph caching methods to enhance the flexibility of static graph.
Controlled tensor memory pool to ensure address security and reusability.
Integration and adaptation of performance-critical custom operators (e.g., PageAttention, AllReduce).

Efficient Memory Optimization

Mapping management between discrete physical memory and continuous virtual memory.
On-demand memory allocation to reduce memory fragmentation.
Intelligent scheduling of memory pages to increase memory reusability.
Adaptation of corresponding operators for domestic accelerators.

Global KV Cache Management

Intelligent offloading and prefetching of KV in hierarchical caches.
KV cache-centric distributed storage architecture.
Intelligent KV routing among computing nodes.

Algorithm-driven Acceleration

Speculative decoding optimization to improve efficiency through multi-core parallelism.
Dynamic load balancing of MoE experts to achieve efficient adjustment of expert distribution.

3. Quick Start

Please refer to Quick Start for more details. Besides, please check the model support status at Model Support List.

4. Contributing

There are several ways you can contribute to xLLM:

Reporting Issues (Bugs & Errors)
Suggesting Enhancements
Improving Documentation
- Fork the repository
- Add your view in document
- Send your pull request
Writing Code
- Fork the repository
- Create a new branch
- Add your feature or improvement
- Send your pull request

We appreciate all kinds of contributions! 🎉🎉🎉 If you have problems about development, please check our document: Document

5. Community & Support

If you encounter any issues along the way, you are welcomed to submit reproducible steps and log snippets in the project's Issues area, or contact the xLLM Core team directly via your internal Slack. In addition, we have established official WeChat groups. You can access the following QR code to join. Welcome to contact us!

6. Acknowledgment

This project was made possible thanks to the following open-source projects:

ScaleLLM - xLLM draws inspiration from ScaleLLM's graph construction method and references its runtime execution.
Mooncake - Build xLLM hybrid KV cache management based on Mooncake.
brpc - Build high-performance http service based on brpc.
tokenizers-cpp - Build C++ tokenizer based on tokenizers-cpp.
safetensors - xLLM relies on the C binding safetensors capability.
Partial JSON Parser - Implement xLLM's C++ JSON parser with insights from Python and Go implementations.
concurrentqueue - A fast multi-producer, multi-consumer lock-free concurrent queue for C++11.
Flashinfer - High-performance NVIDIA GPU kernels.

Thanks to the following collaborating university laboratories:

THU-MIG (School of Software, BNRist, Tsinghua University)
USTC-Cloudlab (Cloud Computing Lab, University of Science and Technology of China)
Beihang-HiPO (Beihang HiPO research group)
PKU-DS-LAB (Data Structure Laboratory, Peking University)
PKU-NetSys-LAB (NetSys Lab, Peking University)

Thanks to all the following developers who have contributed to xLLM.

7. License

Apache License

xLLM is provided by JD.com

Thanks for your Contributions!

8. Citation

If you think this repository is helpful to you, welcome to cite us:

@article{liu2025xllm,
  title={xLLM Technical Report},
  author={Liu, Tongxuan and Peng, Tao and Yang, Peijun and Zhao, Xiaoyang and Lu, Xiusheng and Huang, Weizhe and Liu, Zirui and Chen, Xiaoyu and Liang, Zhiwei and Xiong, Jun and others},
  journal={arXiv preprint arXiv:2510.14686},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 467 Commits
.gemini		.gemini
.github		.github
cibuild		cibuild
cmake		cmake
docker		docker
docs		docs
examples		examples
third_party		third_party
tools		tools
xllm		xllm
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.style.yapf		.style.yapf
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTING_zh.md		CONTRIBUTING_zh.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE_Third_Party.md		NOTICE_Third_Party.md
README.md		README.md
README_zh.md		README_zh.md
RELEASE.md		RELEASE.md
mkdocs_en.yml		mkdocs_en.yml
mkdocs_zh.yml		mkdocs_zh.yml
setup.py		setup.py
vcpkg.json		vcpkg.json
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📢 News

1. Project Overview

2. Core Features

3. Quick Start

4. Contributing

5. Community & Support

6. Acknowledgment

7. License

xLLM is provided by JD.com

Thanks for your Contributions!

8. Citation

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors 37

Languages

License

jd-opensource/xllm

Folders and files

Latest commit

History

Repository files navigation

📢 News

1. Project Overview

2. Core Features

3. Quick Start

4. Contributing

5. Community & Support

6. Acknowledgment

7. License

xLLM is provided by JD.com

Thanks for your Contributions!

8. Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors 37

Languages

Packages