Skip to content

Add full test suite workflow#7795

Merged
tohtana merged 2 commits intodeepspeedai:masterfrom
tohtana:tohtana/add_full_test_workflow_v2
Jan 20, 2026
Merged

Add full test suite workflow#7795
tohtana merged 2 commits intodeepspeedai:masterfrom
tohtana:tohtana/add_full_test_workflow_v2

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Jan 18, 2026

We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra.
To make the tests pass, we need to merge these PRs:

In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness:

  • Ignore flags for some known issues:
    • nvme: Requires an actual NVMe device. Our CI currently doesn't have NVMe storage configured
    • GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to enable direct GPU-to-storage transfers. CI instances don't have this configured.
    • Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation has pre-existing bugs that cause internal pytest errors and worker crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses torch.optim.AdamW which does CUDA graph capture checks that fail in forked processes (--forked flag, we can just move it to sequential tests)
  • /mnt/aio mount for async I/O tests
  • CUTLASS installation for Evoformer tests
  • Add DS_DISABLE_REUSE_DIST_ENV to the test harness to prevent worker cleanup hangs

Once we merge this PR, we will be able to run the full test manually or at scheduled times.

Add a new workflow file for running the full DeepSpeed unit test suite
on AWS L40S runners. This includes:
- CUTLASS installation for Evoformer tests
- Full dependencies (transformers, pytest-timeout, etc.)
- DS_DISABLE_REUSE_DIST_ENV to prevent worker cleanup hangs
- /mnt/aio mount for async I/O tests
- Parallel tests (-n 8) and sequential tests
- Ignore flags for known issues (nvme, GDS, zenflow, etc.)

This workflow is separate from aws-torch-latest.yml which runs only V1 tests.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Allow disabling reuse_dist_env via environment variable. This is useful
for CI full test runs where reusing the distributed environment can cause
pool worker cleanup to hang after tests complete.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@PKUWZP PKUWZP self-requested a review January 20, 2026 05:24
@tohtana tohtana merged commit 5aa2d17 into deepspeedai:master Jan 20, 2026
11 checks passed
phalani-paladugu pushed a commit to phalani-paladugu/DeepSpeed that referenced this pull request Jan 29, 2026
We have been disabled the full unit test workflow for a while. This PR
migrates the full test to our AWS test infra.
To make the tests pass, we need to merge these PRs:
- deepspeedai#7786
- deepspeedai#7788
- deepspeedai#7789
- deepspeedai#7790
- deepspeedai#7793
- deepspeedai#7794

In addition having these PRs merged, this PR has the following changes
in the full test workflow and test harness:
- Ignore flags for some known issues:
- nvme: Requires an actual NVMe device. Our CI currently doesn't have
NVMe storage configured
- GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to
enable direct GPU-to-storage transfers. CI instances don't have this
configured.
- Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation
has pre-existing bugs that cause internal pytest errors and worker
crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses
torch.optim.AdamW which does CUDA graph capture checks that fail in
forked processes (--forked flag, we can just move it to sequential
tests)
- `/mnt/aio` mount for async I/O tests
- CUTLASS installation for Evoformer tests
- Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker
cleanup hangs

Once we merge this PR, we will be able to run the full test manually or
at scheduled times.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants