Skip to content

Conversation

@liaocz
Copy link
Collaborator

@liaocz liaocz commented Nov 25, 2025

No description provided.

@liaocz liaocz requested a review from LLLLKKKK as a code owner November 25, 2025 11:53
@github-actions
Copy link

internal source has been updated, please review the changes!

@liaocz liaocz force-pushed the hotfix/rocm_custom_ar branch from 28073ff to 303a32b Compare November 26, 2025 11:13
@github-actions
Copy link

internal source has been updated, please review the changes!


// meta data buffers need to be "uncached" for signal on MI200
meta_ = aiter::allocate_meta_buffer(aiter::meta_size() + comm_buf_threshold_);
buffer_ = torch::empty(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原因?测试?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用GPU加载模型和开启custom_allreduce时产生的多tp精度问题,定位在初始化时使用torch分配的buffer存在被干扰的情况,所以使用hipmalloc进行管理,较大的dense模型和vl模型等均验证通过了;
smoke test也已经添加

@liaocz liaocz force-pushed the hotfix/rocm_custom_ar branch from 303a32b to 9a05005 Compare December 1, 2025 09:47
@github-actions
Copy link

github-actions bot commented Dec 1, 2025

internal source has been updated, please review the changes!

1 similar comment
@github-actions
Copy link

github-actions bot commented Dec 1, 2025

internal source has been updated, please review the changes!

@liaocz liaocz force-pushed the hotfix/rocm_custom_ar branch from 9a05005 to fc25079 Compare December 2, 2025 09:05
@github-actions
Copy link

github-actions bot commented Dec 2, 2025

internal source has been updated, please review the changes!

1 similar comment
@github-actions
Copy link

github-actions bot commented Dec 2, 2025

internal source has been updated, please review the changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants