| Documentation | Technical Report |
- 2025-12-21: π We day-0 support high-performance inference for the GLM-4.7 model.
- 2025-12-08: π We day-0 support high-performance inference for the GLM-4.6V model.
- 2025-12-05: π We now support high-performance inference for the GLM-4.5/GLM-4.6 series models.
- 2025-12-05: π We now support high-performance inference for the VLM-R1 model.
- 2025-12-05: π We build hybrid KV cache management based on Mooncake, supporting global KV cache management with intelligent offloading and prefetching.
- 2025-10-16: π We recently have released our xLLM Technical Report on arXiv, providing comprehensive technical blueprints and implementation insights.
xLLM is an efficient LLM inference framework, specifically optimized for Chinese AI accelerators, enabling enterprise-grade deployment with enhanced efficiency and reduced cost. The framework adopts a service-engine decoupled inference architecture, achieving breakthrough efficiency through several technologies: at the service layer, including elastic scheduling of online/offline requests, dynamic PD disaggregation, a hybrid EPD mechanism for multimodal and high-availability fault tolerance; and at the engine layer, combined with technologies such as multi-stream parallel computing, graph fusion optimization, speculative inference, dynamic load balancing and global KV cache management. The overall architecture is shown below:
xLLM already supports efficient deployment of mainstream large models (such as DeepSeek-V3.1, Qwen2/3, etc.) on Chinese AI accelerators, empowering enterprises to implement high-performance, low-cost AI large model applications. xLLM has been fully deployed in JD.comβs real core retail businesses, covering a variety of scenarios including intelligent customer service, risk control, supply chain optimization, ad recommendation, and more.
xLLM delivers robust intelligent computing capabilities. By leveraging hardware system optimization and algorithm-driven decision control, it jointly accelerates the inference process, enabling high-throughput, low-latency distributed inference services.
Full Graph Pipeline Execution Orchestration
- Asynchronous decoupled scheduling at the requests scheduling layer, to reduce computational bubbles.
- Asynchronous parallelism of computation and communication at the model graph layer, overlapping computation and communication.
- Pipelining of heterogeneous computing units at the operator kernel layer, overlapping computation and memory access.
Graph Optimization for Dynamic Shapes
- Dynamic shape adaptation based on parameterization and multi-graph caching methods to enhance the flexibility of static graph.
- Controlled tensor memory pool to ensure address security and reusability.
- Integration and adaptation of performance-critical custom operators (e.g., PageAttention, AllReduce).
Efficient Memory Optimization
- Mapping management between discrete physical memory and continuous virtual memory.
- On-demand memory allocation to reduce memory fragmentation.
- Intelligent scheduling of memory pages to increase memory reusability.
- Adaptation of corresponding operators for domestic accelerators.
Global KV Cache Management
- Intelligent offloading and prefetching of KV in hierarchical caches.
- KV cache-centric distributed storage architecture.
- Intelligent KV routing among computing nodes.
Algorithm-driven Acceleration
- Speculative decoding optimization to improve efficiency through multi-core parallelism.
- Dynamic load balancing of MoE experts to achieve efficient adjustment of expert distribution.
Please refer to Quick Start for more details. Besides, please check the model support status at Model Support List.
There are several ways you can contribute to xLLM:
- Reporting Issues (Bugs & Errors)
- Suggesting Enhancements
- Improving Documentation
- Fork the repository
- Add your view in document
- Send your pull request
- Writing Code
- Fork the repository
- Create a new branch
- Add your feature or improvement
- Send your pull request
We appreciate all kinds of contributions! πππ If you have problems about development, please check our document: Document
If you encounter any issues along the way, you are welcomed to submit reproducible steps and log snippets in the project's Issues area, or contact the xLLM Core team directly via your internal Slack. In addition, we have established official WeChat groups. You can access the following QR code to join. Welcome to contact us!
This project was made possible thanks to the following open-source projects:
- ScaleLLM - xLLM draws inspiration from ScaleLLM's graph construction method and references its runtime execution.
- Mooncake - Build xLLM hybrid KV cache management based on Mooncake.
- brpc - Build high-performance http service based on brpc.
- tokenizers-cpp - Build C++ tokenizer based on tokenizers-cpp.
- safetensors - xLLM relies on the C binding safetensors capability.
- Partial JSON Parser - Implement xLLM's C++ JSON parser with insights from Python and Go implementations.
- concurrentqueue - A fast multi-producer, multi-consumer lock-free concurrent queue for C++11.
- Flashinfer - High-performance NVIDIA GPU kernels.
Thanks to the following collaborating university laboratories:
- THU-MIG (School of Software, BNRist, Tsinghua University)
- USTC-Cloudlab (Cloud Computing Lab, University of Science and Technology of China)
- Beihang-HiPO (Beihang HiPO research group)
- PKU-DS-LAB (Data Structure Laboratory, Peking University)
- PKU-NetSys-LAB (NetSys Lab, Peking University)
Thanks to all the following developers who have contributed to xLLM.
If you think this repository is helpful to you, welcome to cite us:
@article{liu2025xllm,
title={xLLM Technical Report},
author={Liu, Tongxuan and Peng, Tao and Yang, Peijun and Zhao, Xiaoyang and Lu, Xiusheng and Huang, Weizhe and Liu, Zirui and Chen, Xiaoyu and Liang, Zhiwei and Xiong, Jun and others},
journal={arXiv preprint arXiv:2510.14686},
year={2025}
}


