-
Notifications
You must be signed in to change notification settings - Fork 2
docs: site skeleton & initial placeholder content #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
lbliii
wants to merge
26
commits into
main
Choose a base branch
from
llane/site-config-and-skeleton
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Lawrence Lane <[email protected]>
…language from training paradigms - Remove specific node count limits (16 nodes) lacking code evidence - Change 'Unlimited nodes' to 'Large multi-node clusters' for accuracy - Replace 'webdataset format' with 'Energon data loader' (verified in code) - Remove subjective time estimates (minutes/hours) from setup complexity - Improve precision of scalability descriptions throughout Signed-off-by: Lawrence Lane <[email protected]>
- Add complete fsdp config examples showing all 4 parallelism dimensions - Replace specific bandwidth numbers with general high-bandwidth requirement - Clarify pipeline bubble efficiency without unverified percentages - Remove unverified 2× memory claim for optimizer state sharding - Add runtime verification examples for checking parallelism config - Add note about automatic DP calculation in automodel - Improve DP calculation example with concrete numbers Signed-off-by: Lawrence Lane <[email protected]>
… structure, progressive disclosure, and clearer examples Signed-off-by: Lawrence Lane <[email protected]>
…bleshooting detail - Change content_type from tutorial to how-to (correct classification) - Improve progressive disclosure with clearer step labels - Add verified configuration parameters from source code - Enhance troubleshooting with specific symptoms and actionable solutions - Add checkpoint structure details and contents - Improve configuration override explanation with three-layer precedence - Add missing checkpoint configuration options - Fix list spacing for markdown lint compliance Signed-off-by: Lawrence Lane <[email protected]>
…l vs megatron); add automodel track (training + inference); add megatron track (data prep + training + inference); update index to route users by use case Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
* init Signed-off-by: Alexandros Koumparoulis <[email protected]> * add sigma_min/amx Signed-off-by: Alexandros Koumparoulis <[email protected]> * add sigma_min/max Signed-off-by: Alexandros Koumparoulis <[email protected]> * rename fientune.py to train.py Signed-off-by: Alexandros Koumparoulis <[email protected]> * add from_config Signed-off-by: Alexandros Koumparoulis <[email protected]> * pass scheduler and model Signed-off-by: Alexandros Koumparoulis <[email protected]> * update param Signed-off-by: Alexandros Koumparoulis <[email protected]> * introduce NeMoWanPipeline Signed-off-by: Alexandros Koumparoulis <[email protected]> * add mode Signed-off-by: Alexandros Koumparoulis <[email protected]> * update build_model_and_optimizer Signed-off-by: Alexandros Koumparoulis <[email protected]> * update Signed-off-by: Alexandros Koumparoulis <[email protected]> * update NeMoWanPipeline Signed-off-by: Alexandros Koumparoulis <[email protected]> * rename Signed-off-by: Alexandros Koumparoulis <[email protected]> * move examples Signed-off-by: Alexandros Koumparoulis <[email protected]> * move Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix imports Signed-off-by: Alexandros Koumparoulis <[email protected]> * lint Signed-off-by: Alexandros Koumparoulis <[email protected]> * more lint Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix import Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix 3rdparty & pyproject Signed-off-by: Alexandros Koumparoulis <[email protected]> * add torch Signed-off-by: Alexandros Koumparoulis <[email protected]> * update uv.lock Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * update Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * revert 3rdparty Signed-off-by: Alexandros Koumparoulis <[email protected]> * update uv.lock Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * update uv.lock Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Pablo Garay <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
* add tests Signed-off-by: linnan wang <[email protected]> * update test Signed-off-by: linnan wang <[email protected]> * update Signed-off-by: linnan wang <[email protected]> * update Signed-off-by: linnan wang <[email protected]> --------- Signed-off-by: linnan wang <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
* adding tests * ruff lint * ruff lint * ruff lint * Explicit mcore path override to use Megatron-Bridge's pinned submodule commit Signed-off-by: Pablo Garay <[email protected]> * Update Megatron-Bridge submodule to latest main with correct Megatron-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <[email protected]> * Add Mcore WAN pretrain mock test to CI/CD Signed-off-by: Pablo Garay <[email protected]> * lintfix Signed-off-by: Pablo Garay <[email protected]> * Fix slow Docker build from Megatron-LM source Signed-off-by: Pablo Garay <[email protected]> * ci: Update gpu runners to use self-hosted-nemo (#48) * ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <[email protected]> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <[email protected]> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <[email protected]> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <[email protected]> * Revert GHA changes Signed-off-by: Charlie Truong <[email protected]> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <[email protected]> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <[email protected]> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <[email protected]> * Revert "Revert GHA changes" This reverts commit d7ad1ab. * tempfortest: timeout setting Signed-off-by: Pablo Garay <[email protected]> * workflow dispatch Signed-off-by: Pablo Garay <[email protected]> * update Signed-off-by: Pablo Garay <[email protected]> * add logging Signed-off-by: Pablo Garay <[email protected]> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <[email protected]> Signed-off-by: Pablo Garay <[email protected]> Co-authored-by: Abhinav Garg <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Pablo Garay <[email protected]> * Reapply "Revert GHA changes" This reverts commit fdb911f. Signed-off-by: Pablo Garay <[email protected]> * update path per request Signed-off-by: Pablo Garay <[email protected]> * lintfix Signed-off-by: Pablo Garay <[email protected]> * update CONTRIBUTING.md Signed-off-by: Pablo Garay <[email protected]> * lintfix Signed-off-by: Pablo Garay <[email protected]> * adding v run --group megatron-bridge * update test * ruff lint * restore Dockerfile.ci * update .github/workflows/cicd-main.yml --------- Signed-off-by: Pablo Garay <[email protected]> Signed-off-by: Charlie Truong <[email protected]> Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Co-authored-by: Charlie Truong <[email protected]> Co-authored-by: Abhinav Garg <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
* introduce step_scheduler section Signed-off-by: Alexandros Koumparoulis <[email protected]> * add step_scheduler section Signed-off-by: Alexandros Koumparoulis <[email protected]> * lint Signed-off-by: Alexandros Koumparoulis <[email protected]> * rm dead code Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
* replace torch.stack with torch.cat Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
0ffe23d to
f2c7c94
Compare
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR provides:
note: the content staged here can either be used as a starting point or just as a reference to be deleted. I tried my best to make the content realistic, but ultimately these articles need SME input/direction. Future sections are yet to be determined.
To preview docs: