Add full test suite workflow#7795
Merged
tohtana merged 2 commits intodeepspeedai:masterfrom Jan 20, 2026
Merged
Conversation
Add a new workflow file for running the full DeepSpeed unit test suite on AWS L40S runners. This includes: - CUTLASS installation for Evoformer tests - Full dependencies (transformers, pytest-timeout, etc.) - DS_DISABLE_REUSE_DIST_ENV to prevent worker cleanup hangs - /mnt/aio mount for async I/O tests - Parallel tests (-n 8) and sequential tests - Ignore flags for known issues (nvme, GDS, zenflow, etc.) This workflow is separate from aws-torch-latest.yml which runs only V1 tests. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Allow disabling reuse_dist_env via environment variable. This is useful for CI full test runs where reusing the distributed environment can cause pool worker cleanup to hang after tests complete. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
PKUWZP
approved these changes
Jan 20, 2026
phalani-paladugu
pushed a commit
to phalani-paladugu/DeepSpeed
that referenced
this pull request
Jan 29, 2026
We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra. To make the tests pass, we need to merge these PRs: - deepspeedai#7786 - deepspeedai#7788 - deepspeedai#7789 - deepspeedai#7790 - deepspeedai#7793 - deepspeedai#7794 In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness: - Ignore flags for some known issues: - nvme: Requires an actual NVMe device. Our CI currently doesn't have NVMe storage configured - GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to enable direct GPU-to-storage transfers. CI instances don't have this configured. - Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation has pre-existing bugs that cause internal pytest errors and worker crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses torch.optim.AdamW which does CUDA graph capture checks that fail in forked processes (--forked flag, we can just move it to sequential tests) - `/mnt/aio` mount for async I/O tests - CUTLASS installation for Evoformer tests - Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker cleanup hangs Once we merge this PR, we will be able to run the full test manually or at scheduled times. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra.
To make the tests pass, we need to merge these PRs:
In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness:
/mnt/aiomount for async I/O testsDS_DISABLE_REUSE_DIST_ENVto the test harness to prevent worker cleanup hangsOnce we merge this PR, we will be able to run the full test manually or at scheduled times.