-
Notifications
You must be signed in to change notification settings - Fork 60
feat: add operator fusion support with dynamic scheduling #212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
hootandy321
wants to merge
14
commits into
InfiniTensor:demo131
Choose a base branch
from
hootandy321:demo131-fusion
base: demo131
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
issue/991 optimize input preparation
This commit adds fusion optimization support for LLaMA models, enabling
dynamic scheduling of fused operators (swiglu, add_rms_norm) with runtime
control via FusionContext.
## Core Features
### 1. Fusion Context System
- Added FusionContext class for runtime fusion control
- Supports per-operation fusion decisions (set/get/has/clear)
- Thread-safe configuration management
### 2. C++ Integration
- Modified LlamaConfig: added `enable_fusion` flag (default: false)
- Modified LlamaMLP: conditional swiglu fusion with fallback
- Modified LlamaDecoderLayer: prepared for add_rms_norm fusion
- All changes are backward compatible and opt-in
### 3. Python API
- Added fusion_utils.py: FusionManager and pattern creators
- Added fused_infer_engine.py: FusedInferEngine with 3 fusion modes
- always_fuse: always use fused kernels
- never_fuse: always use separate kernels
- profile: smart scheduling based on heuristics
- Updated __init__.py with conditional fusion imports
### 4. Testing and Benchmarking
- Added benchmark_fusion_e2e.py: end-to-end fusion performance testing
- Added test_llama_fusion.py: fusion unit tests
- Scripts verify correctness and performance improvements
## Compatibility
- **100% backward compatible**: Fusion disabled by default
- **No API changes**: Existing code works without modifications
- **Opt-in**: Enable via config.enable_fusion = True
- **Safe fallback**: Automatic fallback to non-fused path on errors
## Files Added
### C++
- csrc/fusion/fusion_context.{cpp,hpp}: Fusion context implementation
- csrc/pybind11/fusion.hpp: Python bindings
### Python
- python/infinilm/fusion_utils.py
- python/infinilm/fused_infer_engine.py
- examples/benchmark_fusion_e2e.py
- test_llama_fusion.py
## Files Modified
### C++
- csrc/models/llama/llama_config.hpp: +enable_fusion flag
- csrc/models/llama/llama_mlp.{cpp,hpp}: +fusion logic
- csrc/models/llama/llama_decoder_layer.{cpp,hpp}: +fusion support
- csrc/pybind11/bindings.cc: +fusion bindings
### Python
- python/infinilm/__init__.py: conditional fusion exports
- python/infinilm/models/llama/modeling_llama.py: try-import fusion
## Usage
```python
# Option 1: Use FusedInferEngine (recommended)
from infinilm.fused_infer_engine import FusedInferEngine
engine = FusedInferEngine(model_path, fusion_mode="smart_schedule")
# Option 2: Enable fusion in existing code
from infinilm.auto_config import AutoConfig
config = AutoConfig.from_pretrained(model_path)
config.enable_fusion = True # Enable fusion
```
Co-Authored-By: liuxingyu <[email protected]>
be5878b to
4340dff
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Summary: [demo131] 引入算子融合优化及端到端评测工具
核心改动
本 PR 针对 LLaMA 推理引入可选的算子融合(Operator Fusion)优化,主要服务于 demo131 的性能演示与对比评测。
算子融合机制:新增
FusionContext运行时控制,支持swiglu、add_rms_norm等算子的动态融合与安全回退。模型层集成:在
LlamaConfig中新增enable_fusion开关(默认关闭),在LlamaMLP等模块中引入融合路径。推理引擎扩展:提供
FusedInferEngine接口,支持always_fuse、never_fuse及profile(动态调度)三种模式。评测工具:新增
benchmark_fusion_e2e.py,支持自动化对比不同融合策略的端到端耗时。实验结果 (demo131)
测试模型:TinyLlama-1.1B-Chat | 硬件:天数BI150 单卡
运行方式
1. 执行 Benchmark
Bash
2. 代码调用示例
Python
兼容性说明
所有功能默认为 显式开启 (opt-in),不影响现有推理流程。
融合失败时自动回退至非融合路径,保证推理正确性。
为保证不影响最终整体演示,暂未修改jiuge/bench等文件