-
Notifications
You must be signed in to change notification settings - Fork 19
fix: resolve multi-iteration tensor file overwrite and simplify precision checker #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…sion checker Counter mechanism: - Add ResetCounters() to clear tensor counter at iteration boundaries - Move counter management to PrecisionCheckEnv with thread_local storage - Call ResetCounters() at start of each training step in gpt2/llama3 Precision checker refactoring: - Remove baseline comparison functionality (use separate script instead) - Remove table format output, keep only simple and md5 formats - Add TensorStats struct with min/max/mean/nan_count/inf_count - Add SaveNpy() function for NPY file saving with rank subdirectories - Simplify log output format with dtype, shape, stats, and first 6 values - Change stage names from "Module Forward/Backward Output" to "Forward/Backward Output" - Use std::filesystem instead of sys/stat.h for directory creation Documentation and scripts: - Update docs/precision_checker_guide.md with current implementation - Add precision_compare.py for offline NPY comparison - Add run_precision_check_gpt2.sh and run_precision_check_llama3.sh Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add GlobalModuleHookRegistry singleton to decouple PrecisionChecker from Module::operator(), allowing any hook to be registered globally - Add md5_tolerance config option for PrecisionChecker to handle BF16 precision differences (e.g., md5_tolerance=1e-3 makes 4.0003 and 4.0004 produce the same MD5 hash) - Update gpt2 and llama3 examples to use the new hook registration API Co-Authored-By: Claude Opus 4.5 <[email protected]>
Replace all Chinese comments with English translations in global_module_hook_registry.h for better international accessibility.
|
另外,麻烦在 scripts/run_models_and_profile.bash 这个测试脚本里,末尾加一下执行 compare_loss.py 对比精度的步骤吧(可以加一个配置传入用于对比的 log 路径,没传就默认不执行 loss 对比脚本) |
- Rename compare_loss.py/compare_tps.py from tools/ to scripts/ - Add --verbose flag to comparison scripts for detailed output - Show full paths in "Files only in..." messages - Only print comparison details for mismatches (quiet by default) - Add precision_check_config.json and run_precision_check.sh unified runner - Delete old run_precision_check_gpt2.sh/llama3.sh scripts - Add COMPARE_LOG_DIR support to run_models_and_profile.bash - Add tls_ prefix to thread_local variables for consistency - Add error handling with log tail output in run_models_and_profile.bash - Fix timestamped_path_ default initialization Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
done |
- Remove ambiguous run_precision_check.sh and precision_check_config.json - Add PrecisionChecker::Init() for automatic module hook registration Co-Authored-By: Claude Opus 4.5 <[email protected]>
…erHook - RegisterHook now returns std::unique_ptr<HookHandle> for hook removal
| void GlobalModuleHookRegistry::ApplyHooks(nn::Module *module) { | ||
| std::lock_guard<std::mutex> lock(mutex_); | ||
| if (applied_modules_.contains(module)) { | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这样做仍旧是只执行一次吧,应该是 handle 调用 remove 后,下面的 registrars_ 里没有 registrar 了所以就不执行了,这里不需要判断 applied_modules 提前返回。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我理解本来逻辑就是执行一次,全局注册 registrar,首次 forward 时把 hook 装到 module 上,然后以后就靠 module 自己的 hooks 机制跑
- Update precision checker to use new hook registration APIs Co-Authored-By: Claude Opus 4.5 <[email protected]>
Summary
Fix precision checker file accumulation issue during multi-iteration runs and simplify the overall implementation.
Changes
Counter Mechanism Fix
ResetCounters()method to reset tensor counter at iteration boundariesPrecisionCheckEnvwiththread_localstorage for thread safetyResetCounters()at the start of each training step in gpt2/llama3Precision Checker Refactoring
SaveNpy()function with rank subdirectory support[GAS-X] [L-Y] name_idx_stage tensor[i]: dtype=... shape=... min=... max=... mean=... [values] [NaN:X Inf:Y]New Scripts
scripts/precision_check/precision_compare.py- Offline NPY comparison toolscripts/precision_check/run_precision_check_gpt2.sh- GPT2 verification scriptscripts/precision_check/run_precision_check_llama3.sh- LLaMA3 verification scriptDocumentation
docs/precision_checker_guide.mdto reflect current implementationUsage Example
Testing Example
Run verification script: