fix: resolve multi-iteration tensor file overwrite and simplify precision checker #104

chen2021673 · 2026-01-22T07:55:38Z

Summary

Fix precision checker file accumulation issue during multi-iteration runs and simplify the overall implementation.

Changes

Counter Mechanism Fix

Add ResetCounters() method to reset tensor counter at iteration boundaries
Move counter management to PrecisionCheckEnv with thread_local storage for thread safety
Call ResetCounters() at the start of each training step in gpt2/llama3

Precision Checker Refactoring

Remove baseline comparison functionality (use separate script instead)
Remove table format output, keep only simple and md5 formats
Add SaveNpy() function with rank subdirectory support
Simplify log format: [GAS-X] [L-Y] name_idx_stage tensor[i]: dtype=... shape=... min=... max=... mean=... [values] [NaN:X Inf:Y]

New Scripts

scripts/precision_check/precision_compare.py - Offline NPY comparison tool
scripts/precision_check/run_precision_check_gpt2.sh - GPT2 verification script
scripts/precision_check/run_precision_check_llama3.sh - LLaMA3 verification script

Documentation

Update docs/precision_checker_guide.md to reflect current implementation

Usage Example

# Basic check
./build/gpt2 --precision_check "level=1" --num_iteration 1

# Save NPY files
./build/gpt2 --precision_check "level=1,save_tensors=true" --num_iteration 1

# MD5 format
./build/gpt2 --precision_check "level=1,format=md5" --num_iteration 1

# Compare two runs
python scripts/precision_check/precision_compare.py \
    --dir1 ./precision_check/run1 \
    --dir2 ./precision_check/run2

Testing Example

Run verification script:

bash scripts/precision_check/run_precision_check_gpt2.sh

…sion checker Counter mechanism: - Add ResetCounters() to clear tensor counter at iteration boundaries - Move counter management to PrecisionCheckEnv with thread_local storage - Call ResetCounters() at start of each training step in gpt2/llama3 Precision checker refactoring: - Remove baseline comparison functionality (use separate script instead) - Remove table format output, keep only simple and md5 formats - Add TensorStats struct with min/max/mean/nan_count/inf_count - Add SaveNpy() function for NPY file saving with rank subdirectories - Simplify log output format with dtype, shape, stats, and first 6 values - Change stage names from "Module Forward/Backward Output" to "Forward/Backward Output" - Use std::filesystem instead of sys/stat.h for directory creation Documentation and scripts: - Update docs/precision_checker_guide.md with current implementation - Add precision_compare.py for offline NPY comparison - Add run_precision_check_gpt2.sh and run_precision_check_llama3.sh Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add GlobalModuleHookRegistry singleton to decouple PrecisionChecker from Module::operator(), allowing any hook to be registered globally - Add md5_tolerance config option for PrecisionChecker to handle BF16 precision differences (e.g., md5_tolerance=1e-3 makes 4.0003 and 4.0004 produce the same MD5 hash) - Update gpt2 and llama3 examples to use the new hook registration API Co-Authored-By: Claude Opus 4.5 <[email protected]>

Replace all Chinese comments with English translations in global_module_hook_registry.h for better international accessibility.

infini_train/include/utils/precision_check_config.h

infini_train/src/utils/precision_check_config.cc

scripts/precision_check/run_precision_check_gpt2.sh

scripts/precision_check/run_precision_check_llama3.sh

kilinchange · 2026-01-28T03:38:17Z

另外，麻烦在 scripts/run_models_and_profile.bash 这个测试脚本里，末尾加一下执行 compare_loss.py 对比精度的步骤吧（可以加一个配置传入用于对比的 log 路径，没传就默认不执行 loss 对比脚本）

- Rename compare_loss.py/compare_tps.py from tools/ to scripts/ - Add --verbose flag to comparison scripts for detailed output - Show full paths in "Files only in..." messages - Only print comparison details for mismatches (quiet by default) - Add precision_check_config.json and run_precision_check.sh unified runner - Delete old run_precision_check_gpt2.sh/llama3.sh scripts - Add COMPARE_LOG_DIR support to run_models_and_profile.bash - Add tls_ prefix to thread_local variables for consistency - Add error handling with log tail output in run_models_and_profile.bash - Fix timestamped_path_ default initialization Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

chen2021673 · 2026-01-29T06:34:49Z

另外，麻烦在 scripts/run_models_and_profile.bash 这个测试脚本里，末尾加一下执行 compare_loss.py 对比精度的步骤吧（可以加一个配置传入用于对比的 log 路径，没传就默认不执行 loss 对比脚本）

done

- Remove ambiguous run_precision_check.sh and precision_check_config.json - Add PrecisionChecker::Init() for automatic module hook registration Co-Authored-By: Claude Opus 4.5 <[email protected]>

infini_train/src/utils/global_module_hook_registry.cc

…erHook - RegisterHook now returns std::unique_ptr<HookHandle> for hook removal

kilinchange · 2026-01-30T01:56:53Z

infini_train/src/utils/global_module_hook_registry.cc

+void GlobalModuleHookRegistry::ApplyHooks(nn::Module *module) {
+    std::lock_guard<std::mutex> lock(mutex_);
+    if (applied_modules_.contains(module)) {
+        return;


这样做仍旧是只执行一次吧，应该是 handle 调用 remove 后，下面的 registrars_ 里没有 registrar 了所以就不执行了，这里不需要判断 applied_modules 提前返回。

我理解本来逻辑就是执行一次，全局注册 registrar，首次 forward 时把 hook 装到 module 上，然后以后就靠 module 自己的 hooks 机制跑

- Update precision checker to use new hook registration APIs Co-Authored-By: Claude Opus 4.5 <[email protected]>

chen2021673 and others added 3 commits January 22, 2026 07:49

docs: translate Chinese comments to English in GlobalModuleHookRegistry

2d8abfe

Replace all Chinese comments with English translations in global_module_hook_registry.h for better international accessibility.

kilinchange requested review from Chamberlain0w0 and kilinchange January 26, 2026 07:25

kilinchange requested changes Jan 28, 2026

View reviewed changes

chen2021673 requested review from JYMiracle305 and kilinchange January 29, 2026 06:35

chen2021673 and others added 2 commits January 29, 2026 09:10

refactor: simplify precision checker and improve test coverage

8b55238

- Remove ambiguous run_precision_check.sh and precision_check_config.json - Add PrecisionChecker::Init() for automatic module hook registration Co-Authored-By: Claude Opus 4.5 <[email protected]>

style: apply clang-format-16

8a7796d

kilinchange requested changes Jan 29, 2026

View reviewed changes

infini_train/src/utils/global_module_hook_registry.cc Outdated Show resolved Hide resolved

feat: add HookHandle return value to GlobalModuleHookRegistry::Regist…

88f5e30

…erHook - RegisterHook now returns std::unique_ptr<HookHandle> for hook removal

chen2021673 requested a review from kilinchange January 29, 2026 15:20

kilinchange requested changes Jan 30, 2026

View reviewed changes

refactor: refactor GlobalModuleHookRegistry

c014d0f

- Update precision checker to use new hook registration APIs Co-Authored-By: Claude Opus 4.5 <[email protected]>

chen2021673 requested a review from kilinchange January 30, 2026 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve multi-iteration tensor file overwrite and simplify precision checker #104

fix: resolve multi-iteration tensor file overwrite and simplify precision checker #104

Uh oh!

chen2021673 commented Jan 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kilinchange commented Jan 28, 2026

Uh oh!

chen2021673 commented Jan 29, 2026

Uh oh!

Uh oh!

kilinchange Jan 30, 2026

Uh oh!

chen2021673 Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: resolve multi-iteration tensor file overwrite and simplify precision checker #104

Are you sure you want to change the base?

fix: resolve multi-iteration tensor file overwrite and simplify precision checker #104

Uh oh!

Conversation

chen2021673 commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Counter Mechanism Fix

Precision Checker Refactoring

New Scripts

Documentation

Usage Example

Testing Example

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kilinchange commented Jan 28, 2026

Uh oh!

chen2021673 commented Jan 29, 2026

Uh oh!

Uh oh!

kilinchange Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

chen2021673 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chen2021673 commented Jan 22, 2026 •

edited

Loading