Add Quantizers for Qwen3VLMoeTextDecoderLayer by soodoshll · Pull Request #666 · NVIDIA/Model-Optimizer

soodoshll · 2025-12-08T19:32:05Z

What does this PR do?

Type of change: ? new feature

Overview: ? huggingface transformers library implements Qwen3VL Moe layer as a monolithic module, instead of assembling it using Linear layers, which cannot be recognized by modelopt's quantizer now. This PR introduces a conversion from hf's qwen3vl_moe MoE layers to qewn3_moe MoE layers which consist of a set of Linear layers.

Testing

Tested with

python hf_ptq.py --pyt_ckpt_path=Qwen/Qwen3-VL-30B-A3B-Instruct --qformat=nvfp4 --dataset wikipedia

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

Release Notes

New Features
- Added quantization support for Qwen3VL models with sparse mixture-of-experts (MoE) architecture, enabling efficient model compression for this model type.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

copy-pr-bot · 2025-12-08T19:32:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

modelopt/torch/quantization/plugins/huggingface.py

shengliangxu

LGTM

Signed-off-by: Qidong Su <qidongs@nvidia.com> Signed-off-by: Qidong Su <soodoshll@gmail.com>

coderabbitai · 2026-01-16T19:15:59Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

A new quantization wrapper class _QuantQwen3VLMoeTextExperts is introduced to handle Qwen3 VL MoE text experts by replacing weight attributes with nn.Linear-based modules. The class includes forward pass logic for expert selection and gating, with duplicated registration blocks mapping to the QuantModuleRegistry.

Changes

Cohort / File(s)	Summary
Qwen3 VL MoE Quantization Module `modelopt/torch/quantization/plugins/huggingface.py`	Added `_QuantQwen3VLMoeTextExperts` wrapper class with weight conversion to nn.Linear modules for gate_up_proj, up_proj, and down_proj. Implements forward pass for expert selection with routing-based gating and weighted output accumulation. Added import-and-register logic with duplicated registration blocks for Qwen3VLMoeTextSparseMoeBlock and Qwen3VLMoeTextExperts to ensure consistent module registration regardless of import order.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add Quantizers for Qwen3VLMoeTextDecoderLayer' is specific and directly related to the main change: adding quantization support for Qwen3VL MoE layers by introducing wrapper classes and registration logic.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 630-640: expert_idx is a 1-element tensor and is being used to
index nn.ModuleList (self.gate_proj/self.up_proj/self.down_proj) and
routing_weights, which raises a TypeError; convert expert_idx to a Python int
(e.g., expert_idx_int = int(expert_idx.item() or expert_idx[0].item()) ) before
using it to index the ModuleList entries and use that int for routing_weights
indexing (routing_weights[token_idx, expert_idx_int, None]) while keeping
token_idx as a tensor for per-token selection, then proceed with
gate_proj/up_proj/down_proj using the integer index and next_states.index_add_
as before.
- Around line 574-616: In _QuantQwen3VLMoeTextExperts._setup the code references
a non-existent attribute self.expert_dim when slicing self.gate_up_proj, which
will raise AttributeError; update the slices to use the correct per-expert size
attribute self.moe_intermediate_size (i.e. replace uses of self.expert_dim in
the gate_up_proj slicing with self.moe_intermediate_size) so the _copy_weight
calls for gate_proj and up_proj use the proper slice, leaving down_proj and
attribute reassignment (gate_up_proj → gate_proj, down_proj → down_proj)
unchanged.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6e4efd and 41efac8.

📒 Files selected for processing (1)

modelopt/torch/quantization/plugins/huggingface.py

🧰 Additional context used

🧬 Code graph analysis (1)

modelopt/torch/quantization/plugins/huggingface.py (1)

modelopt/torch/quantization/conversion.py (1)

register (330-371)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: wait-checks / wait
GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (1)

modelopt/torch/quantization/plugins/huggingface.py (1)

735-755: LGTM!

The registration blocks follow the established pattern used throughout this file for conditional module registration. The try/except ImportError guards appropriately handle cases where the transformers version doesn't include the Qwen3VL MoE model.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-16T19:21:28Z

modelopt/torch/quantization/plugins/huggingface.py

+class _QuantQwen3VLMoeTextExperts(QuantModule):
+    def _setup(self):
+        """Modify the Qwen3VLMoeTextExperts by using nn.Linear layers."""
+        from accelerate import init_empty_weights
+
+        dtype, device = self.gate_up_proj.dtype, self.gate_up_proj.device
+
+        def _copy_weight(module, weight):
+            module.to_empty(device=device)
+            with torch.no_grad():
+                module.weight.data = weight.detach().data.to(dtype=dtype, device=device)
+
+        with init_empty_weights():
+            gate_proj = nn.ModuleList(
+                [
+                    nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+                    for _ in range(self.num_experts)
+                ]
+            )
+            up_proj = nn.ModuleList(
+                [
+                    nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+                    for _ in range(self.num_experts)
+                ]
+            )
+            down_proj = nn.ModuleList(
+                [
+                    nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+                    for _ in range(self.num_experts)
+                ]
+            )
+
+        for idx in range(self.num_experts):
+            _copy_weight(gate_proj[idx], self.gate_up_proj[idx, :, : self.expert_dim].T)
+            _copy_weight(up_proj[idx], self.gate_up_proj[idx, :, self.expert_dim :].T)
+            _copy_weight(down_proj[idx], self.down_proj[idx, :].T)
+
+        delattr(self, "gate_up_proj")
+        delattr(self, "down_proj")
+        self.gate_proj = gate_proj
+        self.up_proj = up_proj
+        self.down_proj = down_proj
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First check if transformers is available and inspect Qwen3VLMoeTextExperts python3 << 'EOF' import sys try: # Try to import the class from transformers.models.qwen3_vl_moe.modeling_qwen3_vl_moe import Qwen3VLMoeTextExperts import inspect print("=== Qwen3VLMoeTextExperts Class Found ===") # Get source code try: source = inspect.getsource(Qwen3VLMoeTextExperts) # Check for expert_dim if 'expert_dim' in source: print("✓ expert_dim FOUND in class source") else: print("✗ expert_dim NOT found in class source") # Check for intermediate_size if 'intermediate_size' in source: print("✓ intermediate_size found in class source") # Check __init__ signature init_sig = inspect.signature(Qwen3VLMoeTextExperts.__init__) print(f"__init__ parameters: {list(init_sig.parameters.keys())}") except Exception as e: print(f"Could not get source: {e}") except ImportError as e: print(f"✗ Could not import Qwen3VLMoeTextExperts: {e}") print("\nTrying alternative import paths...") try: from transformers import AutoModel print("AutoModel available, but Qwen3VLMoeTextExperts not directly importable") except: print("Transformers not fully available") EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 152

🏁 Script executed:

#!/bin/bash # Also search the reviewed file for any expert_dim references echo "=== Searching reviewed file for expert_dim references ===" if [ -f "modelopt/torch/quantization/plugins/huggingface.py" ]; then rg "expert_dim" modelopt/torch/quantization/plugins/huggingface.py -n -C 2 else echo "File not found at expected location" fi

Repository: NVIDIA/Model-Optimizer

Length of output: 436

🌐 Web query:

Qwen3VLMoeTextExperts expert_dim attribute huggingface transformers

💡 Result:

Short answer: Hugging Face's Qwen3/VL-MoE configs do not expose an attribute named expert_dim. For MoE in transformers you should use:

num_experts — number of experts. [1][2]

num_experts_per_tok (top‑k) — number of experts routed per token. [1][2]

moe_intermediate_size — the hidden (intermediate) size used inside each routed expert (what some codebases call an expert hidden-dimension). [1][3]

Evidence:

Qwen3MoE / Qwen3VLMoeConfig docs list num_experts, num_experts_per_tok and moe_intermediate_size but do not define expert_dim. [1][2][3]

Other libraries sometimes use the name expert_dim (e.g., MindSpore examples) to mean either the number of experts or the per‑expert dimension depending on their API — so naming differs between frameworks. [4][5]

If you expected an expert_dim field in transformers, use moe_intermediate_size for per-expert hidden size and num_experts for expert count. References: Hugging Face transformers Qwen3/Qwen3-VL-MoE docs and MindSpore MoE docs. [1][2][3][4]

Sources
[1] Hugging Face — Qwen3MoE config docs.
[2] Hugging Face — Qwen3-VL-Moe config docs.
[3] Hugging Face (alternate version) — Qwen3-VL-Moe parameters.
[4] MindSpore MoE implementation (expert_dim usage).
[5] MindSpore MoE (other release).

Remove undefined expert_dim attribute reference — this will cause AttributeError at runtime.

The expert_dim attribute does not exist on HuggingFace's Qwen3VLMoeTextExperts class. Lines 607–608 access self.expert_dim, which will fail at runtime. The correct attribute for the per-expert intermediate size is moe_intermediate_size. Replace:

self.gate_up_proj[idx, :, : self.expert_dim].T self.gate_up_proj[idx, :, self.expert_dim :].T

with:

self.gate_up_proj[idx, :, : self.moe_intermediate_size].T self.gate_up_proj[idx, :, self.moe_intermediate_size :].T

🤖 Prompt for AI Agents

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 574 - 616, In _QuantQwen3VLMoeTextExperts._setup the code references a non-existent attribute self.expert_dim when slicing self.gate_up_proj, which will raise AttributeError; update the slices to use the correct per-expert size attribute self.moe_intermediate_size (i.e. replace uses of self.expert_dim in the gate_up_proj slicing with self.moe_intermediate_size) so the _copy_weight calls for gate_proj and up_proj use the proper slice, leaving down_proj and attribute reassignment (gate_up_proj → gate_proj, down_proj → down_proj) unchanged.

coderabbitai · 2026-01-16T19:21:28Z

modelopt/torch/quantization/plugins/huggingface.py

+        for expert_idx in expert_hit[:]:
+            assert expert_idx.numel() == 1, expert_idx
+            with torch.no_grad():
+                _, token_idx = torch.where(expert_mask[expert_idx[0]])
+            current_state = hidden_states[token_idx]
+            gate = self.gate_proj[expert_idx](current_state)
+            up = self.up_proj[expert_idx](current_state)
+            gated_output = up * self.act_fn(gate)
+            out = self.down_proj[expert_idx](gated_output)
+            weighted_output = out * routing_weights[token_idx, expert_idx, None]
+            next_states.index_add_(0, token_idx, weighted_output.to(hidden_states.dtype))


⚠️ Potential issue | 🔴 Critical

Bug: Using tensor to index nn.ModuleList will fail.

When iterating over expert_hit (from nonzero()), each expert_idx is a 1D tensor of shape (1,), not a Python integer. nn.ModuleList.__getitem__ expects an integer index, so self.gate_proj[expert_idx] will raise a TypeError.

Convert expert_idx to an integer before indexing:

🐛 Proposed fix

- for expert_idx in expert_hit[:]: - assert expert_idx.numel() == 1, expert_idx + for expert_idx in expert_hit: + expert_idx = expert_idx.item() with torch.no_grad(): - _, token_idx = torch.where(expert_mask[expert_idx[0]]) + _, token_idx = torch.where(expert_mask[expert_idx]) current_state = hidden_states[token_idx] gate = self.gate_proj[expert_idx](current_state) up = self.up_proj[expert_idx](current_state) gated_output = up * self.act_fn(gate) out = self.down_proj[expert_idx](gated_output) - weighted_output = out * routing_weights[token_idx, expert_idx, None] + weighted_output = out * routing_weights[token_idx, expert_idx:expert_idx+1, None] next_states.index_add_(0, token_idx, weighted_output.to(hidden_states.dtype))

Note: The routing_weights[token_idx, expert_idx, None] indexing at line 639 may also need adjustment after converting to int, depending on the expected tensor shape.

🤖 Prompt for AI Agents

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 630 - 640, expert_idx is a 1-element tensor and is being used to index nn.ModuleList (self.gate_proj/self.up_proj/self.down_proj) and routing_weights, which raises a TypeError; convert expert_idx to a Python int (e.g., expert_idx_int = int(expert_idx.item() or expert_idx[0].item()) ) before using it to index the ModuleList entries and use that int for routing_weights indexing (routing_weights[token_idx, expert_idx_int, None]) while keeping token_idx as a tensor for per-token selection, then proceed with gate_proj/up_proj/down_proj using the integer index and next_states.index_add_ as before.

Edwardf0t1

LGTM, left a few minor comments.

modelopt/torch/quantization/plugins/huggingface.py

Co-authored-by: Zhiyu <bestczy317@gmail.com> Signed-off-by: Qidong Su <soodoshll@gmail.com>

Signed-off-by: Qidong Su <soodoshll@gmail.com>

Signed-off-by: Qidong Su <qidongs@nvidia.com>

Victor49152

New change about attribute naming has been verified, should be clear to go

soodoshll requested a review from a team as a code owner December 8, 2025 19:32

soodoshll requested a review from ajrasane December 8, 2025 19:32

shengliangxu self-requested a review December 8, 2025 19:33

soodoshll force-pushed the qwen3-vl-moe branch from 2c18fd0 to 7a16307 Compare December 8, 2025 19:58

soodoshll requested review from a team as code owners December 8, 2025 19:58

soodoshll requested a review from ynankani December 8, 2025 19:58

shengliangxu reviewed Dec 10, 2025

View reviewed changes

modelopt/torch/quantization/plugins/huggingface.py Outdated Show resolved Hide resolved

shengliangxu approved these changes Dec 17, 2025

View reviewed changes

soodoshll force-pushed the qwen3-vl-moe branch 3 times, most recently from f5c78c9 to ff666d5 Compare December 18, 2025 20:22

soodoshll added 7 commits December 18, 2025 20:24

upd

d3013f6

Signed-off-by: Qidong Su <qidongs@nvidia.com> Signed-off-by: Qidong Su <soodoshll@gmail.com>

fix

70ddcc4

Signed-off-by: Qidong Su <qidongs@nvidia.com> Signed-off-by: Qidong Su <soodoshll@gmail.com>

fix

4ac0c87

Signed-off-by: Qidong Su <qidongs@nvidia.com> Signed-off-by: Qidong Su <soodoshll@gmail.com>

fix

904be6a

Signed-off-by: Qidong Su <qidongs@nvidia.com> Signed-off-by: Qidong Su <soodoshll@gmail.com>

upd

ef200ea

Signed-off-by: Qidong Su <qidongs@nvidia.com> Signed-off-by: Qidong Su <soodoshll@gmail.com>

refactor to directly impl qwen3_vl_moe

60b9e75

Signed-off-by: Qidong Su <qidongs@nvidia.com> Signed-off-by: Qidong Su <soodoshll@gmail.com>

format

37c24f4

Signed-off-by: Qidong Su <qidongs@nvidia.com> Signed-off-by: Qidong Su <soodoshll@gmail.com>

soodoshll force-pushed the qwen3-vl-moe branch from ff666d5 to 37c24f4 Compare December 18, 2025 20:24

soodoshll requested a review from shengliangxu January 16, 2026 19:03

Merge branch 'main' into qwen3-vl-moe

41efac8

shengliangxu requested a review from Edwardf0t1 January 16, 2026 19:16

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

Edwardf0t1 approved these changes Jan 16, 2026

View reviewed changes

shengliangxu requested a review from kaix-nv January 16, 2026 19:43

kaix-nv approved these changes Jan 16, 2026

View reviewed changes

soodoshll and others added 6 commits January 16, 2026 15:19

Update modelopt/torch/quantization/plugins/huggingface.py

ac6d76d

Co-authored-by: Zhiyu <bestczy317@gmail.com> Signed-off-by: Qidong Su <soodoshll@gmail.com>

fix

1c02c90

Signed-off-by: Qidong Su <soodoshll@gmail.com>

fmt

5ad1f7a

Signed-off-by: Qidong Su <soodoshll@gmail.com>

Merge branch 'main' into qwen3-vl-moe

f4c5de5

fix

6fab4a4

Signed-off-by: Qidong Su <qidongs@nvidia.com>

Merge branch 'main' into qwen3-vl-moe

b4919e0

Victor49152 approved these changes Jan 19, 2026

View reviewed changes

shengliangxu merged commit 21a4010 into NVIDIA:main Jan 20, 2026
36 checks passed

derekelewis mentioned this pull request Jan 31, 2026

Quantizing Qwen VL 30bA3 #716

Open

Conversation

soodoshll commented Dec 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Dec 8, 2025

Uh oh!

Uh oh!

shengliangxu left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Victor49152 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

soodoshll commented Dec 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 16, 2026 •

edited

Loading