Skip to content

Conversation

@chen2021673
Copy link
Contributor

@chen2021673 chen2021673 commented Jan 22, 2026

Summary

Fix precision checker file accumulation issue during multi-iteration runs and simplify the overall implementation.

Changes

Counter Mechanism Fix

  • Add ResetCounters() method to reset tensor counter at iteration boundaries
  • Move counter management to PrecisionCheckEnv with thread_local storage for thread safety
  • Call ResetCounters() at the start of each training step in gpt2/llama3

Precision Checker Refactoring

  • Remove baseline comparison functionality (use separate script instead)
  • Remove table format output, keep only simple and md5 formats
  • Add SaveNpy() function with rank subdirectory support
  • Simplify log format: [GAS-X] [L-Y] name_idx_stage tensor[i]: dtype=... shape=... min=... max=... mean=... [values] [NaN:X Inf:Y]

New Scripts

  • scripts/precision_check/precision_compare.py - Offline NPY comparison tool
  • scripts/precision_check/run_precision_check_gpt2.sh - GPT2 verification script
  • scripts/precision_check/run_precision_check_llama3.sh - LLaMA3 verification script

Documentation

  • Update docs/precision_checker_guide.md to reflect current implementation

Usage Example

# Basic check
./build/gpt2 --precision_check "level=1" --num_iteration 1

# Save NPY files
./build/gpt2 --precision_check "level=1,save_tensors=true" --num_iteration 1

# MD5 format
./build/gpt2 --precision_check "level=1,format=md5" --num_iteration 1

# Compare two runs
python scripts/precision_check/precision_compare.py \
    --dir1 ./precision_check/run1 \
    --dir2 ./precision_check/run2

Testing Example

Run verification script:

bash scripts/precision_check/run_precision_check_gpt2.sh

chen2021673 and others added 3 commits January 22, 2026 07:49
…sion checker

Counter mechanism:
- Add ResetCounters() to clear tensor counter at iteration boundaries
- Move counter management to PrecisionCheckEnv with thread_local storage
- Call ResetCounters() at start of each training step in gpt2/llama3

Precision checker refactoring:
- Remove baseline comparison functionality (use separate script instead)
- Remove table format output, keep only simple and md5 formats
- Add TensorStats struct with min/max/mean/nan_count/inf_count
- Add SaveNpy() function for NPY file saving with rank subdirectories
- Simplify log output format with dtype, shape, stats, and first 6 values
- Change stage names from "Module Forward/Backward Output" to "Forward/Backward Output"
- Use std::filesystem instead of sys/stat.h for directory creation

Documentation and scripts:
- Update docs/precision_checker_guide.md with current implementation
- Add precision_compare.py for offline NPY comparison
- Add run_precision_check_gpt2.sh and run_precision_check_llama3.sh

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add GlobalModuleHookRegistry singleton to decouple PrecisionChecker
  from Module::operator(), allowing any hook to be registered globally
- Add md5_tolerance config option for PrecisionChecker to handle BF16
  precision differences (e.g., md5_tolerance=1e-3 makes 4.0003 and
  4.0004 produce the same MD5 hash)
- Update gpt2 and llama3 examples to use the new hook registration API

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Replace all Chinese comments with English translations in
global_module_hook_registry.h for better international accessibility.
@kilinchange
Copy link
Collaborator

另外,麻烦在 scripts/run_models_and_profile.bash 这个测试脚本里,末尾加一下执行 compare_loss.py 对比精度的步骤吧(可以加一个配置传入用于对比的 log 路径,没传就默认不执行 loss 对比脚本)

- Rename compare_loss.py/compare_tps.py from tools/ to scripts/
- Add --verbose flag to comparison scripts for detailed output
- Show full paths in "Files only in..." messages
- Only print comparison details for mismatches (quiet by default)
- Add precision_check_config.json and run_precision_check.sh unified runner
- Delete old run_precision_check_gpt2.sh/llama3.sh scripts
- Add COMPARE_LOG_DIR support to run_models_and_profile.bash
- Add tls_ prefix to thread_local variables for consistency
- Add error handling with log tail output in run_models_and_profile.bash
- Fix timestamped_path_ default initialization

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@chen2021673
Copy link
Contributor Author

另外,麻烦在 scripts/run_models_and_profile.bash 这个测试脚本里,末尾加一下执行 compare_loss.py 对比精度的步骤吧(可以加一个配置传入用于对比的 log 路径,没传就默认不执行 loss 对比脚本)

done

chen2021673 and others added 2 commits January 29, 2026 09:10
- Remove ambiguous run_precision_check.sh and precision_check_config.json
- Add PrecisionChecker::Init() for automatic module hook registration

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…erHook

- RegisterHook now returns std::unique_ptr<HookHandle> for hook removal
void GlobalModuleHookRegistry::ApplyHooks(nn::Module *module) {
std::lock_guard<std::mutex> lock(mutex_);
if (applied_modules_.contains(module)) {
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样做仍旧是只执行一次吧,应该是 handle 调用 remove 后,下面的 registrars_ 里没有 registrar 了所以就不执行了,这里不需要判断 applied_modules 提前返回。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解本来逻辑就是执行一次,全局注册 registrar,首次 forward 时把 hook 装到 module 上,然后以后就靠 module 自己的 hooks 机制跑

- Update precision checker to use new hook registration APIs

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants