-
Notifications
You must be signed in to change notification settings - Fork 129
fix: fix custom_ar bug for rocm #402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
internal source has been updated, please review the changes! |
28073ff to
303a32b
Compare
|
internal source has been updated, please review the changes! |
|
|
||
| // meta data buffers need to be "uncached" for signal on MI200 | ||
| meta_ = aiter::allocate_meta_buffer(aiter::meta_size() + comm_buf_threshold_); | ||
| buffer_ = torch::empty( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原因?测试?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用GPU加载模型和开启custom_allreduce时产生的多tp精度问题,定位在初始化时使用torch分配的buffer存在被干扰的情况,所以使用hipmalloc进行管理,较大的dense模型和vl模型等均验证通过了;
smoke test也已经添加
303a32b to
9a05005
Compare
|
internal source has been updated, please review the changes! |
1 similar comment
|
internal source has been updated, please review the changes! |
9a05005 to
fc25079
Compare
|
internal source has been updated, please review the changes! |
1 similar comment
|
internal source has been updated, please review the changes! |
No description provided.