Skip to content

Conversation

@hootandy321
Copy link

PR Summary: [demo131] 引入算子融合优化及端到端评测工具

核心改动

本 PR 针对 LLaMA 推理引入可选的算子融合(Operator Fusion)优化,主要服务于 demo131 的性能演示与对比评测。

  1. 算子融合机制:新增 FusionContext 运行时控制,支持 swigluadd_rms_norm 等算子的动态融合与安全回退。

  2. 模型层集成:在 LlamaConfig 中新增 enable_fusion 开关(默认关闭),在 LlamaMLP 等模块中引入融合路径。

  3. 推理引擎扩展:提供 FusedInferEngine 接口,支持 always_fusenever_fuseprofile(动态调度)三种模式。

  4. 评测工具:新增 benchmark_fusion_e2e.py,支持自动化对比不同融合策略的端到端耗时。

实验结果 (demo131)

测试模型:TinyLlama-1.1B-Chat | 硬件:天数BI150 单卡

策略 总耗时 相对加速 结论
smart_schedule 3201 ms 1.06× 长序列场景优势最明显 (1.09×)
never_fuse 3352 ms 1.01× 短序列(decode-heavy)略优
always_fuse 3392 ms 1.00× 静态策略无法适配所有场景

运行方式

1. 执行 Benchmark

Bash

python3 examples/benchmark_fusion_e2e.py \
  --iluvatar \
  --model_path /path/to/TinyLlama-1.1B-Chat-v1.0 \
  --runs 2

2. 代码调用示例

Python

from infinilm.fused_infer_engine import FusedInferEngine

# 推荐 demo 演示模式:根据 profiling 自动调度融合
engine = FusedInferEngine(model_path, fusion_mode="profile")

兼容性说明

  • 所有功能默认为 显式开启 (opt-in),不影响现有推理流程。

  • 融合失败时自动回退至非融合路径,保证推理正确性。

  • 为保证不影响最终整体演示,暂未修改jiuge/bench等文件

PanZezhong1725 and others added 13 commits January 22, 2026 08:16
This commit adds fusion optimization support for LLaMA models, enabling
dynamic scheduling of fused operators (swiglu, add_rms_norm) with runtime
control via FusionContext.

## Core Features

### 1. Fusion Context System
- Added FusionContext class for runtime fusion control
- Supports per-operation fusion decisions (set/get/has/clear)
- Thread-safe configuration management

### 2. C++ Integration
- Modified LlamaConfig: added `enable_fusion` flag (default: false)
- Modified LlamaMLP: conditional swiglu fusion with fallback
- Modified LlamaDecoderLayer: prepared for add_rms_norm fusion
- All changes are backward compatible and opt-in

### 3. Python API
- Added fusion_utils.py: FusionManager and pattern creators
- Added fused_infer_engine.py: FusedInferEngine with 3 fusion modes
  - always_fuse: always use fused kernels
  - never_fuse: always use separate kernels
  - profile: smart scheduling based on heuristics
- Updated __init__.py with conditional fusion imports

### 4. Testing and Benchmarking
- Added benchmark_fusion_e2e.py: end-to-end fusion performance testing
- Added test_llama_fusion.py: fusion unit tests
- Scripts verify correctness and performance improvements

## Compatibility

- **100% backward compatible**: Fusion disabled by default
- **No API changes**: Existing code works without modifications
- **Opt-in**: Enable via config.enable_fusion = True
- **Safe fallback**: Automatic fallback to non-fused path on errors

## Files Added

### C++
- csrc/fusion/fusion_context.{cpp,hpp}: Fusion context implementation
- csrc/pybind11/fusion.hpp: Python bindings

### Python
- python/infinilm/fusion_utils.py
- python/infinilm/fused_infer_engine.py
- examples/benchmark_fusion_e2e.py
- test_llama_fusion.py

## Files Modified

### C++
- csrc/models/llama/llama_config.hpp: +enable_fusion flag
- csrc/models/llama/llama_mlp.{cpp,hpp}: +fusion logic
- csrc/models/llama/llama_decoder_layer.{cpp,hpp}: +fusion support
- csrc/pybind11/bindings.cc: +fusion bindings

### Python
- python/infinilm/__init__.py: conditional fusion exports
- python/infinilm/models/llama/modeling_llama.py: try-import fusion

## Usage

```python
# Option 1: Use FusedInferEngine (recommended)
from infinilm.fused_infer_engine import FusedInferEngine
engine = FusedInferEngine(model_path, fusion_mode="smart_schedule")

# Option 2: Enable fusion in existing code
from infinilm.auto_config import AutoConfig
config = AutoConfig.from_pretrained(model_path)
config.enable_fusion = True  # Enable fusion
```

Co-Authored-By: liuxingyu <[email protected]>
@hootandy321 hootandy321 requested a review from a team January 30, 2026 03:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants