Add support for Qwen3-Omni-30B-A3B-Thinking by ajrasane · Pull Request #677 · NVIDIA/Model-Optimizer

ajrasane · 2025-12-11T22:34:05Z

What does this PR do?

Type of change:
Model support

Overview:

Support quantization of Qwen3-Omni-30B-A3B-Thinking

Usage

python hf_ptq.py \
    --pyt_ckpt_path Qwen/Qwen3-Omni-30B-A3B-Thinking \
    --qformat fp8 \
    --calib_size 512 \
    --export_path ./qwen3_omni_30b_fp8 \
    --trust_remote_code \
    --batch_size 2 \
    --calib_size 2 \
    --attn_implementation flash_attention_2

Testing

Able to quantize model and generate output

example outputs before ptq: ['<think>\nGot it, which states are we talking about? Wait, the user didn\'t list any states. Oh, maybe the problem is missing the list
? Wait, no, maybe this is a standard question where the options are implied? Wait, no, the user probably forgot to include the options. Wait, but maybe in the orig
inal context, there were states listed, but here it\'s cut off. Wait, no, looking back: the user says "Which of these states is farthest north?" but didn\'t provid
e the "these states" part. Oh, maybe this is a common question where the options are like Maine, Florida, etc. Wait, but maybe the user made a mistake. Wait, no, m
aybe in the problem, the states are implied by the context. Wait, no, let\'s think: the farthest north state in the US is Alaska, but if it\'s contiguous US, it\'s
 Minnesota or North Dakota? Wait, no, North Dakota is farther north than Minnesota. Wait, but maybe the options are different. Wait, but the user didn\'t list the 
states. Wait, maybe this is a trick question where the answer is Alaska, but let\'s check. Wait, no, the user probably forgot to include the options. Wait, but maybe in the original problem, the states are given, but here it\'s missing. Wait, no, maybe the user is referring to a standard set. Wait, let\'s think: common states for such questions: Alaska, Maine, North Dakota, Minnesota, etc. Alaska is the northernmost state, with its northernmost point at 71°23\' N latitude. The contiguous US has North Dakota as the northernmost, but Alaska is a state. So if Alaska is an option, it\'s Alaska. But since the user didn\'t list the states, maybe they expect Alaska. Wait, but maybe the question is from a specific set. Wait, no, the user probably made a mistake, but in standard US geography, the northernmost state is Alaska. Let\'s confirm: Alaska\'s northernmost point is Cape Prince of Wales at 71°23\' N, while the contiguous US has North Dakota at about 49° N, so Alaska is way farther north. So if Alaska is one of the options, it\'s Alaska. Since the user didn\'t list the states, but this is a common question, the answer is Alaska.\n</think>\n\nTo determine which state is farthest north, we analyze the **geographic latitude** of U.S. states. Among all U.S. states, **Alaska** is the northernmost. Its northernmost point (Cape Prince of Wales) lies at approximately **71°23′ N latitude**, far surpassing the northern limits of contiguous states like North Dakota (≈49° N). Even if the question refers to contiguous states only, North Dakota is the northernmost, but since Alaska is a state and the question does not specify "contiguous," **Alaska** is the correct answer.  \n\n**Answer:** Alaska']
--------
example outputs after ptq: ['<think>\nGot it, ```json\n{\n  "question": "Which of these states is farthest north?",\n  "answer": "Alaska"\n}\n```\n</think>\n\nAlaska']

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: No
Did you add or update any necessary documentation?: No
Did you update Changelog?: No

copy-pr-bot · 2025-12-11T22:34:08Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cjluo-nv · 2025-12-15T21:24:58Z

examples/llm_ptq/hf_ptq.py

+                "qwen3omni only supports one dataset for calibration, can extend this in the future"
+            )
+            assert processor is not None, "The processor must be set for qwen3omni model."
+            dataset_name = args.dataset[0] if args.dataset else "scienceqa"


do we still recommend scienceqa as the default calib dataset?

Changed this to cnn_dailymail

cjluo-nv · 2025-12-15T21:26:08Z

examples/llm_ptq/hf_ptq.py

                num_samples=args.calib_size[0],
            )
+        elif model_type == "qwen3omni":
+            assert len(args.calib_size) == 1, (


for this part, I think we may want to host it in a model specific python file/module. E.g. llm_ptq/models/qwen3omni.py.

@shengliangxu WDYT?

We do not need to do it for now, I'll come up with a full design doc and then we can convert the whole repo afterwards. Even if we separate things out now, we may still refactor these anyway.

cjluo-nv · 2025-12-15T21:26:32Z

examples/llm_ptq/hf_ptq.py

+            # if args.verbose:
+            #     mtq.print_quant_summary(full_model)
+
+            import contextlib


move to the top

cjluo-nv · 2025-12-15T21:27:49Z

modelopt/torch/utils/dataset_utils.py

@@ -283,7 +283,8 @@ def _get_free_gpu_mem():

    free_mem_before, max_allocated_before = _get_free_gpu_mem()
    is_enc_dec = model_type_is_enc_dec(model)


can we merge this into _model_requires_generate?

cjluo-nv · 2025-12-15T21:28:49Z

modelopt/torch/utils/image_processor.py

        self.tokenizer = tokenizer
+        # Handle invalid device values that can come from multi-GPU models with device_map="auto"
+        if device is None or str(device) in ("auto", "meta", "cpu"):
+            device = "cuda"


maybe print a warning?

And does it mean if "cuda" not in str(device): device="cuda"?

I have removed this

cjluo-nv · 2025-12-17T01:07:43Z

examples/llm_ptq/hf_ptq.py

    model_is_already_quantized = is_quantized(model)

    model_type = get_model_type(model)
+    if model_type == "qwen3omni" and os.environ.get("DISABLE_TALKER", "0") == "1":


I think we probably need to find a better way for configurations like this

I have disabled the talker quantization by default

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Comment out import and registration of Qwen3OmniMoe classes. Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

coderabbitai · 2026-02-02T19:28:31Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ajrasane/qwen3-omni-30B

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

cjluo-nv · 2026-02-03T07:24:23Z

examples/llm_ptq/example_utils.py

        quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}
        quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}

+    if model_type == "qwen3omni":


I feel this level of qformat is too detailed. Can you recommend one and use it for Qwen3 Omni?

The basic nvfp4 format works fine, we can use that for now. I will add these formats in a separate document for later reference.

cjluo-nv · 2026-02-03T07:25:44Z

examples/llm_ptq/generate_video_dataset.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Script to pre-generate processed video dataset for Qwen3-Omni quantization."""


is this generation script qwen3_omni specific?

Yes, I dont think we need to merge this in our codebase. Will document this separately.

cjluo-nv · 2026-02-03T07:26:36Z

examples/llm_ptq/hf_ptq.py

    "nvfp4_mlp_only": mtq.NVFP4_MLP_ONLY_CFG,
    "nvfp4_svdquant": mtq.NVFP4_SVDQUANT_DEFAULT_CFG,
    "mxfp8": mtq.MXFP8_DEFAULT_CFG,
+    "qwen3_nvfp4_qkv_disabled": mtq.NVFP4_DEFAULT_CFG,


If possible I would recommend we don't introduce this qformats

cjluo-nv · 2026-02-03T07:28:27Z

examples/llm_ptq/run_quantized_qwen3omni.py

+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""Script to load and run a quantized Qwen3Omni model from export_hf_checkpoint."""


qq: why do we need this example?

cjluo-nv · 2026-02-03T07:29:26Z

examples/llm_ptq/run_qwen_vllm.py

+            print(f"  Copied {fname}")
+
+
+def main():


do we need this file? Does vllm serve work?

cjluo-nv · 2026-02-03T07:31:06Z

modelopt/torch/export/unified_export_hf.py

-                    f"Running optimization on language model with fake_input shape: {fake_input.shape}"
-                )
-                language_model(fake_input)
+        with set_quantizer_by_cfg_context(model, {"*": {"enable": False}}):


qq: why do we need this?

cjluo-nv · 2026-02-03T07:32:18Z

modelopt/torch/export/unified_export_hf.py

+                        "This is required for requantization/resmoothing optimization. "
+                        "Please ensure the model architecture is supported or file an issue."
+                    )
+            elif "qwen3omni" in model_type:


can we update get_language_model_from_vl and cover the following logic inside get_language_model_from_vl?

cjluo-nv · 2026-02-03T07:33:00Z

modelopt/torch/export/unified_export_hf.py

        sub_module, quantizer_attrs.weight_quantizer
    )
+
+    # Skip export if weight quantizer is disabled or has no amax (not calibrated)


qq what error will we see if these logics are not added?

This wont be required

cjluo-nv · 2026-02-03T07:33:18Z

modelopt/torch/export/unified_export_hf.py

        dtype: The data type for weight conversion.
        is_modelopt_qlora: Whether the model is a modelopt-trained QLoRA model.
            If True, modules with base_layer attribute are skipped.
+        pack_weights: Whether to pack quantized weights.


why do we need this flag?

I was initially trying to export the checkpoint without packing the weights. But this wont be required as vllm also expects the model to have packed weights.

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

ajrasane self-assigned this Dec 11, 2025

cjluo-nv mentioned this pull request Dec 15, 2025

Support PTQ for Qwen3 Omni #647

Open

cjluo-nv requested a review from shengliangxu December 15, 2025 21:23

cjluo-nv reviewed Dec 15, 2025

View reviewed changes

cjluo-nv reviewed Dec 17, 2025

View reviewed changes

ajrasane force-pushed the ajrasane/qwen3-omni-30B branch 2 times, most recently from 7f80e6f to 0c4b38f Compare December 17, 2025 08:32

ajrasane force-pushed the ajrasane/qwen3-omni-30B branch from 04b3dc6 to 732e686 Compare January 22, 2026 01:22

ajrasane and others added 14 commits February 2, 2026 18:54

Add support for Qwen3-Omni-30B-A3B-Thinking

7dca9d4

Add the finevideo dataset for calibration

f6ac2d3

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Add option to disable talker

616cc1b

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Add quantization configs for the model

cf3bbb8

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Register Qwen3 thinker and talker sparse moe blocks in quant module

c700f91

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

remove first_n and last_n configs

445ff6a

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Update quantization modes to stack on top of one another

de01666

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Add a text processor for text datasets

7ef534a

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Disable Qwen3OmniMoe class registration

7aa5aed

Comment out import and registration of Qwen3OmniMoe classes. Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>

Update logic to disable quantizers

fdad81a

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Add option to save the quantized checkpoint

8a4cfac

Add a script to load and run the qwen3omni quantized checkpoint

e8d9b0e

Create a script to cache the processed dataset

e3337a0

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Support export to hf format

aa77565

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

ajrasane force-pushed the ajrasane/qwen3-omni-30B branch from 2725797 to aa77565 Compare February 2, 2026 19:28

ajrasane force-pushed the ajrasane/qwen3-omni-30B branch from 79ac487 to 0208cc6 Compare February 2, 2026 20:12

ajrasane force-pushed the ajrasane/qwen3-omni-30B branch from 0208cc6 to 3e775ea Compare February 2, 2026 20:16

ajrasane added 2 commits February 2, 2026 22:12

restore configs

4f92fbf

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Added script to run with vllm

3f12551

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

ajrasane force-pushed the ajrasane/qwen3-omni-30B branch from 3e775ea to 3f12551 Compare February 3, 2026 00:00

cjluo-nv reviewed Feb 3, 2026

View reviewed changes

cjluo-nv requested a review from Edwardf0t1 February 3, 2026 07:33

ajrasane added 3 commits February 3, 2026 07:53

Disable audio tower and visual encoder quantization

a2ec8f3

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Add a flag to save the quant summary

0b1d9ca

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Forward tokens to all experts

690620f

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

		@@ -283,7 +283,8 @@ def _get_free_gpu_mem():

		free_mem_before, max_allocated_before = _get_free_gpu_mem()
		is_enc_dec = model_type_is_enc_dec(model)

Conversation

ajrasane commented Dec 11, 2025

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Uh oh!

copy-pr-bot bot commented Dec 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 2, 2026

Review skipped

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants