Introduce all_reduce_hook to support gradient aggregation across replica groups. by zhengchenyu · Pull Request #7764 · deepspeedai/DeepSpeed

zhengchenyu · 2026-01-07T03:07:00Z

Using replica groups offers the following advantages:

For stage 3, it ensures that parameter gather during forward and backward occurs only within the replica group.
Checkpointing is performed only on replica_group_rank=0, guaranteeing constant checkpoint world size and avoiding the universal checkpoint transformations during scaling up or down.

We can achieve gradient all reduce within the replica group after backward and before optimizer.step, but we must wait for all buckets to complete, thus can not leverage concurrency advantages.

I know MICS has similar functionality, but currently only supports zero stage 3. Additionally, I want to use this feature for compatibility with architectures like TorchFT.

…ica groups. Signed-off-by: zhengchenyu <zhengchenyu16@163.com>

sfc-gh-truwase · 2026-01-09T17:15:36Z

@zhengchenyu thanks for the PR. Can you provide some clarification for the motivation?

For stage 3, it ensures that parameter gather during forward and backward occurs only within the replica group.

We already provide a form of this functionality in hpZ component of ZeRO++. Have you explored whether hpZ would meet your needs?

I know MICS has similar functionality, but currently only supports zero stage 3.

My understanding replica groups is only relevant for zero stage 3 since lower stages don't do parameter partitioning. Can you explain how replica groups exist in your workload?

zhengchenyu · 2026-01-10T02:43:19Z

@sfc-gh-truwase Thanks for your review!
My main motivation is to support torchft to achieve fault tolerance. At the same time, I aim to solve the two following problems:

(1) During the forward and backward, parameter gather occurs on all machines.
(2) The zero checkpoint adjusts with the world size, leading to the universal checkpoint conversion.

Regarding zero++. It cannot solve problem (2). It can solve problem (1), but there is a cost involved, we must introduce extra ds_secondory_tensor. Moreover, in the first forward of each step, parameters still need to be collected on all machines.

Regarding MICS. For zero stage 3, these two problem do not exist. For stage 1/2, there are no problems (1), but if the optimizer parameters are considered when loading the checkpoint, there will be problem for issue (2).

sfc-gh-truwase · 2026-01-13T16:10:51Z

Thanks for sharing more details.

ft replicas: While I appreciate the importance of this work, I am concerned about the cost of such an intrusive change while your project is still evolving. My understanding is that you want to synchronize gradients across independent training replicas. In that case why not perform the synchronization explicitly at the client-level. You could use these APIs to obtain the gradients from each engine.
zero_checkpoints: Restricting checkpoint creation to one (or few) replicas will unacceptably increase checkpointing slowdown, especially at large-scale. We consider the one-time universal checkpoint conversion cost a reasonable trade-off as discussed in the paper.

zhengchenyu · 2026-01-14T01:44:38Z

@sfc-gh-truwase Thanks for your reply.
For 1: Initially, I did use averaged_gradients for allreduce on the client-level, but I felt it was slightly inefficient because it had to wait for all buckets to be collected.
For 2: You're right. However, I think if fault tolerance is handled well, we can appropriately reduce the checkpoint frequency.
Since torchft is still evolving, we can postpone this PR.

sfc-gh-truwase · 2026-01-14T02:49:08Z

For 1: Initially, I did use averaged_gradients for allreduce on the client-level, but I felt it was slightly inefficient because it had to wait for all buckets to be collected.

Yes, I agree that existing options like averaged_gradients will be inefficient. But we can revisit a proper deepspeed support once your work gets more matured and various situations like e2e performance are figured out.

Introduce all_reduce_hook to support gradient aggregation across repl…

67950ee

…ica groups. Signed-off-by: zhengchenyu <zhengchenyu16@163.com>

zhengchenyu requested review from loadams, tjruwase and tohtana as code owners January 7, 2026 03:07

Merge branch 'master' into allreduce.hook

94c81e5

Merge branch 'master' into allreduce.hook

da0d71c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce all_reduce_hook to support gradient aggregation across replica groups.#7764

Introduce all_reduce_hook to support gradient aggregation across replica groups.#7764
zhengchenyu wants to merge 3 commits intodeepspeedai:masterfrom
zhengchenyu:allreduce.hook

zhengchenyu commented Jan 7, 2026

Uh oh!

sfc-gh-truwase commented Jan 9, 2026

Uh oh!

zhengchenyu commented Jan 10, 2026

Uh oh!

sfc-gh-truwase commented Jan 13, 2026

Uh oh!

zhengchenyu commented Jan 14, 2026

Uh oh!

sfc-gh-truwase commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhengchenyu commented Jan 7, 2026

Uh oh!

sfc-gh-truwase commented Jan 9, 2026

Uh oh!

zhengchenyu commented Jan 10, 2026

Uh oh!

sfc-gh-truwase commented Jan 13, 2026

Uh oh!

zhengchenyu commented Jan 14, 2026

Uh oh!

sfc-gh-truwase commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants