[BUG] Fix: Fix gradient norm calculation and dynamic shape blocking in PP+ZeRO1 collective communication by Thinksky5124 · Pull Request #7847 · deepspeedai/DeepSpeed

Thinksky5124 · 2026-02-12T06:54:44Z

Describe the bug

This commit fixes gradient normalization bugs when using DeepSpeed Pipeline Parallel (pp) together with ZeRO Stage 1 (zero1), including the following aspects:

PipelineEngine Buffer Type Consistency in Dyanmic Shape
In deepspeed/runtime/pipe/engine.py, the activation buffer previously did not enforce dtype conversion, which could lead to inconsistent types and subsequent calculation errors. Now, the return value is explicitly cast to the target dtype, ensuring type consistency.
ZeRO Stage 1/2 Gradient Normalization Logic Correction
In deepspeed/runtime/zero/stage_1_and_2.py, for both CPU-offload and regular scenarios, the previous gradient normalization involved redundant communication and incorrect normalization:

complete_grad_norm_calculation_for_cpu_offload now only computes the local squared L2 norm without cross-rank communication, avoiding redundant normalization and double counting.

get_grad_norm_direct only supports L2 norm, directly accumulates the local gradient squared sum, and avoids double counting for pipeline parallel parameters.

scaled_global_norm unifies the normalization process: first accumulates the squared sum of gradients for all groups locally, then uses all_reduce to aggregate, and finally takes the square root to obtain the global L2 norm, ensuring consistency with ZeRO design.

Code Redundancy and Exception Handling Optimization
Removed the inf constant and unnecessary norm_type branches, now only supporting L2 norm to simplify the logic. Stricter skipping of None gradients and pipeline-replicated parameters improves robustness.
After these fixes, gradient normalization in pp+zero1 scenarios is more accurate, avoiding double counting, type inconsistency, and redundant communication, thus improving training stability and performance. Reviewers are advised to focus on the correctness and compatibility of gradient normalization code to ensure consistent behavior across different parallel/offload scenarios.

DeepSpeed Config

{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 4,
    "steps_per_print": 1,
    "zero_optimization": {
        "stage": 1
    }
}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99697687fc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-02-12T06:58:42Z

deepspeed/runtime/zero/stage_1_and_2.py

        if self.has_moe_layers:
-            self._average_expert_grad_norms(norm_groups)
+            self._average_expert_grad_norms(local_total_sq_norm)


Preserve per-group norms when averaging MoE expert grads

When has_moe_layers is true, _average_expert_grad_norms expects a per-parameter-group collection (it iterates by group index and checks self.is_moe_param_group[i]), but this call now passes a single accumulated tensor. In runs with multiple optimizer groups, that means MoE scaling is applied to at most index 0 (or to the already-mixed total), so expert and non-expert contributions are mis-scaled before clipping/overflow logic, producing incorrect global grad norms.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-12T06:58:42Z

deepspeed/runtime/zero/stage_1_and_2.py

-            self._average_expert_grad_norms(norm_groups)
+            self._average_expert_grad_norms(local_total_sq_norm)
+
+        local_total_sq_norm = local_total_sq_norm.to(torch.cuda.current_device())


Avoid hard-coding CUDA device for norm reduction

This forces scaled_global_norm onto torch.cuda.current_device() even though the optimizer code is written against DeepSpeed's accelerator abstraction. On non-CUDA backends (or CPU-only execution), this line raises before all_reduce, so gradient norm computation and optimizer step fail outright; the tensor should stay on self.device or use get_accelerator().current_device_name().

Useful? React with 👍 / 👎.

Fix pp+zero1 bugs

9969768

Thinksky5124 requested review from loadams, tjruwase and tohtana as code owners February 12, 2026 06:54

chatgpt-codex-connector bot reviewed Feb 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fix: Fix gradient norm calculation and dynamic shape blocking in PP+ZeRO1 collective communication#7847

[BUG] Fix: Fix gradient norm calculation and dynamic shape blocking in PP+ZeRO1 collective communication#7847
Thinksky5124 wants to merge 1 commit intodeepspeedai:masterfrom
Thinksky5124:master

Thinksky5124 commented Feb 12, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 12, 2026

Uh oh!

chatgpt-codex-connector bot Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thinksky5124 commented Feb 12, 2026

Describe the bug

DeepSpeed Config

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant