Add full test suite workflow by tohtana · Pull Request #7795 · deepspeedai/DeepSpeed

tohtana · 2026-01-18T23:18:46Z

We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra.
To make the tests pass, we need to merge these PRs:

In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness:

Ignore flags for some known issues:
- nvme: Requires an actual NVMe device. Our CI currently doesn't have NVMe storage configured
- GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to enable direct GPU-to-storage transfers. CI instances don't have this configured.
- Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation has pre-existing bugs that cause internal pytest errors and worker crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses torch.optim.AdamW which does CUDA graph capture checks that fail in forked processes (--forked flag, we can just move it to sequential tests)
/mnt/aio mount for async I/O tests
CUTLASS installation for Evoformer tests
Add DS_DISABLE_REUSE_DIST_ENV to the test harness to prevent worker cleanup hangs

Once we merge this PR, we will be able to run the full test manually or at scheduled times.

Add a new workflow file for running the full DeepSpeed unit test suite on AWS L40S runners. This includes: - CUTLASS installation for Evoformer tests - Full dependencies (transformers, pytest-timeout, etc.) - DS_DISABLE_REUSE_DIST_ENV to prevent worker cleanup hangs - /mnt/aio mount for async I/O tests - Parallel tests (-n 8) and sequential tests - Ignore flags for known issues (nvme, GDS, zenflow, etc.) This workflow is separate from aws-torch-latest.yml which runs only V1 tests. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Allow disabling reuse_dist_env via environment variable. This is useful for CI full test runs where reusing the distributed environment can cause pool worker cleanup to hang after tests complete. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra. To make the tests pass, we need to merge these PRs: - deepspeedai#7786 - deepspeedai#7788 - deepspeedai#7789 - deepspeedai#7790 - deepspeedai#7793 - deepspeedai#7794 In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness: - Ignore flags for some known issues: - nvme: Requires an actual NVMe device. Our CI currently doesn't have NVMe storage configured - GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to enable direct GPU-to-storage transfers. CI instances don't have this configured. - Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation has pre-existing bugs that cause internal pytest errors and worker crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses torch.optim.AdamW which does CUDA graph capture checks that fail in forked processes (--forked flag, we can just move it to sequential tests) - `/mnt/aio` mount for async I/O tests - CUTLASS installation for Evoformer tests - Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker cleanup hangs Once we merge this PR, we will be able to run the full test manually or at scheduled times. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>

tohtana added 2 commits January 18, 2026 15:01

tohtana requested review from loadams and tjruwase as code owners January 18, 2026 23:18

tohtana mentioned this pull request Jan 18, 2026

Add workflow to run full tests #7783

Closed

PKUWZP self-requested a review January 20, 2026 05:24

PKUWZP approved these changes Jan 20, 2026

View reviewed changes

tohtana merged commit 5aa2d17 into deepspeedai:master Jan 20, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add full test suite workflow#7795

Add full test suite workflow#7795
tohtana merged 2 commits intodeepspeedai:masterfrom
tohtana:tohtana/add_full_test_workflow_v2

tohtana commented Jan 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tohtana commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tohtana commented Jan 18, 2026 •

edited

Loading