Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
examples/llm_ptq/hf_ptq.py
Outdated
| "qwen3omni only supports one dataset for calibration, can extend this in the future" | ||
| ) | ||
| assert processor is not None, "The processor must be set for qwen3omni model." | ||
| dataset_name = args.dataset[0] if args.dataset else "scienceqa" |
There was a problem hiding this comment.
do we still recommend scienceqa as the default calib dataset?
There was a problem hiding this comment.
Changed this to cnn_dailymail
examples/llm_ptq/hf_ptq.py
Outdated
| num_samples=args.calib_size[0], | ||
| ) | ||
| elif model_type == "qwen3omni": | ||
| assert len(args.calib_size) == 1, ( |
There was a problem hiding this comment.
for this part, I think we may want to host it in a model specific python file/module. E.g. llm_ptq/models/qwen3omni.py.
@shengliangxu WDYT?
There was a problem hiding this comment.
We do not need to do it for now, I'll come up with a full design doc and then we can convert the whole repo afterwards. Even if we separate things out now, we may still refactor these anyway.
examples/llm_ptq/hf_ptq.py
Outdated
| # if args.verbose: | ||
| # mtq.print_quant_summary(full_model) | ||
|
|
||
| import contextlib |
| @@ -283,7 +283,8 @@ def _get_free_gpu_mem(): | |||
|
|
|||
| free_mem_before, max_allocated_before = _get_free_gpu_mem() | |||
| is_enc_dec = model_type_is_enc_dec(model) | |||
There was a problem hiding this comment.
can we merge this into _model_requires_generate?
| self.tokenizer = tokenizer | ||
| # Handle invalid device values that can come from multi-GPU models with device_map="auto" | ||
| if device is None or str(device) in ("auto", "meta", "cpu"): | ||
| device = "cuda" |
There was a problem hiding this comment.
maybe print a warning?
And does it mean if "cuda" not in str(device): device="cuda"?
There was a problem hiding this comment.
I have removed this
examples/llm_ptq/hf_ptq.py
Outdated
| model_is_already_quantized = is_quantized(model) | ||
|
|
||
| model_type = get_model_type(model) | ||
| if model_type == "qwen3omni" and os.environ.get("DISABLE_TALKER", "0") == "1": |
There was a problem hiding this comment.
I think we probably need to find a better way for configurations like this
There was a problem hiding this comment.
I have disabled the talker quantization by default
7f80e6f to
0c4b38f
Compare
04b3dc6 to
732e686
Compare
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Comment out import and registration of Qwen3OmniMoe classes. Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
2725797 to
aa77565
Compare
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
79ac487 to
0208cc6
Compare
0208cc6 to
3e775ea
Compare
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
3e775ea to
3f12551
Compare
| quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False} | ||
| quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False} | ||
|
|
||
| if model_type == "qwen3omni": |
There was a problem hiding this comment.
I feel this level of qformat is too detailed. Can you recommend one and use it for Qwen3 Omni?
There was a problem hiding this comment.
The basic nvfp4 format works fine, we can use that for now. I will add these formats in a separate document for later reference.
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """Script to pre-generate processed video dataset for Qwen3-Omni quantization.""" |
There was a problem hiding this comment.
is this generation script qwen3_omni specific?
There was a problem hiding this comment.
Yes, I dont think we need to merge this in our codebase. Will document this separately.
| "nvfp4_mlp_only": mtq.NVFP4_MLP_ONLY_CFG, | ||
| "nvfp4_svdquant": mtq.NVFP4_SVDQUANT_DEFAULT_CFG, | ||
| "mxfp8": mtq.MXFP8_DEFAULT_CFG, | ||
| "qwen3_nvfp4_qkv_disabled": mtq.NVFP4_DEFAULT_CFG, |
There was a problem hiding this comment.
If possible I would recommend we don't introduce this qformats
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """Script to load and run a quantized Qwen3Omni model from export_hf_checkpoint.""" |
There was a problem hiding this comment.
qq: why do we need this example?
| print(f" Copied {fname}") | ||
|
|
||
|
|
||
| def main(): |
There was a problem hiding this comment.
do we need this file? Does vllm serve work?
| f"Running optimization on language model with fake_input shape: {fake_input.shape}" | ||
| ) | ||
| language_model(fake_input) | ||
| with set_quantizer_by_cfg_context(model, {"*": {"enable": False}}): |
| "This is required for requantization/resmoothing optimization. " | ||
| "Please ensure the model architecture is supported or file an issue." | ||
| ) | ||
| elif "qwen3omni" in model_type: |
There was a problem hiding this comment.
can we update get_language_model_from_vl and cover the following logic inside get_language_model_from_vl?
| sub_module, quantizer_attrs.weight_quantizer | ||
| ) | ||
|
|
||
| # Skip export if weight quantizer is disabled or has no amax (not calibrated) |
There was a problem hiding this comment.
qq what error will we see if these logics are not added?
There was a problem hiding this comment.
This wont be required
| dtype: The data type for weight conversion. | ||
| is_modelopt_qlora: Whether the model is a modelopt-trained QLoRA model. | ||
| If True, modules with base_layer attribute are skipped. | ||
| pack_weights: Whether to pack quantized weights. |
There was a problem hiding this comment.
why do we need this flag?
There was a problem hiding this comment.
I was initially trying to export the checkpoint without packing the weights. But this wont be required as vllm also expects the model to have packed weights.
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
What does this PR do?
Type of change:
Model support
Overview:
Usage
Testing
Able to quantize model and generate output
Before your PR is "Ready for review"