Support KIMI K2 Thinking int4 checkpoint PTQ#669
Conversation
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Edwardf0t1
left a comment
There was a problem hiding this comment.
Is the ckpt generated identical to @jingyu-ml previously generated nvfp4 ckpt?
| pass | ||
|
|
||
| try: | ||
| from compressed_tensors.linear.compressed_linear import CompressedLinear |
There was a problem hiding this comment.
Should we add compressed-tensor as an optional dependency?
There was a problem hiding this comment.
@kevalmorabia97 @realAsma what do you think?
There was a problem hiding this comment.
If a user is quantizing a model with CompressedLinear, wouldn't they already have compressed-tensors pre-installed? What benefit do we have by having it added as an optional dependency?
There was a problem hiding this comment.
compressed-tensors's main dependencies are torch and transformers so should be pretty lightweight to add as a dependency so fine if you want to add. But if its not commonly used by customers, perhaps we can skip it
There was a problem hiding this comment.
Can we move this to a seperate file modelopt/torch/quantization/plugins/compressed_tensor.py?
There was a problem hiding this comment.
If a user is quantizing a model with CompressedLinear, wouldn't they already have compressed-tensors pre-installed?
This is a good point. +1
Are we planning to have any unit tests for compressed tensor integration?
There was a problem hiding this comment.
Can we move this to a seperate file
modelopt/torch/quantization/plugins/compressed_tensor.py?
How strong do you feel about it? Right now I feel this still fall under hf plugins as it's part of the HF's invocation.
|
@cjluo-nv Did we run any deployment and accuracy test for the ckpt generated with this flow to make sure it's correct? Asking because there's a customer who wants to generate the ckpt by themselves. In addition, I heard from @jingyu-ml that we need to modify modeling_deepseek.py to enable our PTQ flow. |
with transformers 4.48, we don't need to modify the original model. I have not got a chance to validate the checkpoint yet. Will probably continue after Christmas break. @Edwardf0t1 is this an urgent request? |
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Hello @cjluo-nv. Please specify which exactly transformers version you have tested this with. I've just used 4.48.0 and quickly hit some import issues (for |
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughChanges extend quantization support for HuggingFace models with a new QuantCompressedLinear module, adjust model loading for pack-quantized format, disable deepseek-specific MLA quantization, refine export/cleanup procedures, and establish bfloat16 as the default precision instead of float16. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
5328ae9 to
66ac7a6
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@examples/llm_ptq/example_utils.py`:
- Around line 354-361: The current check uses
hf_config.quantization_config.get("format", None) which can raise AttributeError
if quantization_config is None or not a dict; instead, defensively read
quant_config first (e.g., quant_config = getattr(hf_config,
"quantization_config", None) or {}), derive quant_format safely (use
dict.get("format") if isinstance(quant_config, dict) else getattr(quant_config,
"format", None)), then compare quant_format == "pack-quantized" before
constructing torch_dtype and calling AutoModelForCausalLM.from_pretrained with
device_map="auto", trust_remote_code and torch_dtype.
- Around line 190-194: The deepseek MLA-disable branch is unreachable because
the `if model_type == "deepseek"` block is nested inside the `if model_type in
["qwen3moe", "qwen3next"] and qformat == "nvfp4":` condition; pull the deepseek
logic out into its own top-level conditional (separate from the
qwen3moe/qwen3next+nvfp4 check) so it runs when `model_type == "deepseek"`, and
keep the existing modifications to `quant_cfg` that set
`quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}` and
`quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}` intact.
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 618-636: Initialize generated_ids_after_ptq before the
verbose/conditional block so it always exists regardless of args.verbose; e.g.,
set generated_ids_after_ptq = None prior to any if args.verbose / generation
logic (the symbol to update is generated_ids_after_ptq and ensure any later
references to it after the verbose block handle the None case or are guarded),
leaving the rest of the generation branches (full_model.generate,
run_nemotron_vl_preview, and the Llama4 warning) unchanged.
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 605-612: In unpack_weight, avoid unguarded deletions that can
raise AttributeError: only delete self.weight_packed and self.weight_scale if
they exist (e.g., check hasattr(self, "weight_packed") / hasattr(self,
"weight_scale")) or perform the deletes inside the branch where
self.quantization_status == QuantizationStatus.COMPRESSED after successful
decompression; reference the unpack_weight method and the
QuantizationStatus.COMPRESSED check to ensure deletions occur only when
appropriate.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
examples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pyexamples/llm_ptq/scripts/huggingface_example.shmodelopt/torch/export/layer_utils.pymodelopt/torch/export/quant_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/plugins/huggingface.py
🧰 Additional context used
🧬 Code graph analysis (3)
modelopt/torch/export/unified_export_hf.py (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
unpack_weight(605-611)
modelopt/torch/quantization/plugins/huggingface.py (1)
modelopt/torch/quantization/conversion.py (1)
register(330-371)
examples/llm_ptq/hf_ptq.py (1)
examples/llm_ptq/example_utils.py (1)
run_nemotron_vl_preview(49-93)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (8)
modelopt/torch/export/layer_utils.py (1)
346-349: LGTM!The updated
is_quantlinearlogic correctly extends detection to includeQuantCompressedLinearwhile maintaining the existingQuantLineardetection andloraexclusion. The implementation aligns with the new_QuantCompressedLinearclass added in the quantization plugins.modelopt/torch/quantization/plugins/huggingface.py (1)
689-697: LGTM!The registration follows the established pattern used for other optional HuggingFace model types (Llama4, Dbrx, Mixtral, etc.) with proper ImportError handling for the optional
compressed-tensorsdependency.modelopt/torch/export/quant_utils.py (1)
884-890: LGTM!Adding
"weight_shape"to the skip_keys set correctly filters out metadata attributes from CompressedLinear modules that shouldn't be included in the exported state dict. This aligns with the new_QuantCompressedLinearsupport.modelopt/torch/export/unified_export_hf.py (3)
395-396: LGTM with a note on performance.The explicit CUDA cache clearing helps prevent OOM during export of large models. Note that calling
empty_cache()after every layer can add some overhead for models with many layers, but the memory safety benefit likely outweighs this for the target use case (KIMI K2 which is a large MoE model).
523-525: LGTM!The weight unpacking step correctly prepares CompressedLinear modules for export by decompressing packed weights before quantization processing. The
hasattrguard ensures this only applies to modules with theweight_packedattribute.
597-600: LGTM!Removing the
hf_quantizerattribute before callingsave_pretrainedensures that HuggingFace's serialization doesn't interfere with the custom quantized state dict export. Thegetattrwith defaultNonesafely handles models without this attribute.examples/llm_ptq/scripts/huggingface_example.sh (1)
56-58: LGTM!The
nvfp4_mlp_onlyformat is correctly added to both the validation list and the corresponding error message, maintaining consistency with the new quantization format support inhf_ptq.py.examples/llm_ptq/example_utils.py (1)
382-388: Default dtype change fromfloat16tobfloat16looks good.This is a sensible default for modern GPUs (Ampere and newer) with better numerical stability. The
getattrpattern preserves model-specified dtypes when available.Note: Pre-Ampere GPUs may have limited bfloat16 support, but this should be fine as calibration workloads typically target newer hardware.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
examples/llm_ptq/example_utils.py
Outdated
| if model_type == "deepseek": | ||
| # Disable MLA quantization for accuracy. | ||
| quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False} | ||
| quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False} | ||
|
|
There was a problem hiding this comment.
Deepseek MLA disable logic is unreachable due to incorrect nesting.
The if model_type == "deepseek" block is nested inside the if model_type in ["qwen3moe", "qwen3next"] and qformat == "nvfp4": condition (line 183). Since model_type cannot simultaneously be "deepseek" and in ["qwen3moe", "qwen3next"], this code will never execute.
🐛 Proposed fix - move to a separate conditional block
if model_type in ["qwen3moe", "qwen3next"] and qformat == "nvfp4":
# Disable the attention projection layers to retain accuracy
quant_cfg["quant_cfg"]["model*.*attn*in_proj*"] = {"enable": False}
quant_cfg["quant_cfg"]["model*.*attn*q_proj*"] = {"enable": False}
quant_cfg["quant_cfg"]["model*.*attn*k_proj*"] = {"enable": False}
quant_cfg["quant_cfg"]["model*.*attn*v_proj*"] = {"enable": False}
- if model_type == "deepseek":
- # Disable MLA quantization for accuracy.
- quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}
- quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}
+ if model_type == "deepseek":
+ # Disable MLA quantization for accuracy.
+ quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}
+ quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}
return quant_cfg🤖 Prompt for AI Agents
In `@examples/llm_ptq/example_utils.py` around lines 190 - 194, The deepseek
MLA-disable branch is unreachable because the `if model_type == "deepseek"`
block is nested inside the `if model_type in ["qwen3moe", "qwen3next"] and
qformat == "nvfp4":` condition; pull the deepseek logic out into its own
top-level conditional (separate from the qwen3moe/qwen3next+nvfp4 check) so it
runs when `model_type == "deepseek"`, and keep the existing modifications to
`quant_cfg` that set `quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable":
False}` and `quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}`
intact.
examples/llm_ptq/example_utils.py
Outdated
| elif hf_config.quantization_config.get("format", None) == "pack-quantized": | ||
| torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16) | ||
| model = AutoModelForCausalLM.from_pretrained( | ||
| ckpt_path, | ||
| device_map="auto", | ||
| trust_remote_code=trust_remote_code, | ||
| torch_dtype=torch_dtype, | ||
| ) |
There was a problem hiding this comment.
Potential AttributeError if quantization_config is not present or not a dict.
hf_config.quantization_config.get("format", None) will raise an AttributeError if quantization_config is None or doesn't exist on the config. Consider using getattr with a safe fallback.
🐛 Proposed fix
- elif hf_config.quantization_config.get("format", None) == "pack-quantized":
+ elif getattr(hf_config, "quantization_config", None) and hf_config.quantization_config.get("format", None) == "pack-quantized":
torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(
ckpt_path,
device_map="auto",
trust_remote_code=trust_remote_code,
torch_dtype=torch_dtype,
)Alternatively, use a more defensive pattern:
elif getattr(getattr(hf_config, "quantization_config", {}), "get", lambda k, d: d)("format", None) == "pack-quantized":Or extract the check:
quant_config = getattr(hf_config, "quantization_config", None) or {}
quant_format = quant_config.get("format") if isinstance(quant_config, dict) else getattr(quant_config, "format", None)
...
elif quant_format == "pack-quantized":🤖 Prompt for AI Agents
In `@examples/llm_ptq/example_utils.py` around lines 354 - 361, The current check
uses hf_config.quantization_config.get("format", None) which can raise
AttributeError if quantization_config is None or not a dict; instead,
defensively read quant_config first (e.g., quant_config = getattr(hf_config,
"quantization_config", None) or {}), derive quant_format safely (use
dict.get("format") if isinstance(quant_config, dict) else getattr(quant_config,
"format", None)), then compare quant_format == "pack-quantized" before
constructing torch_dtype and calling AutoModelForCausalLM.from_pretrained with
device_map="auto", trust_remote_code and torch_dtype.
examples/llm_ptq/hf_ptq.py
Outdated
| # Run some samples | ||
| torch.cuda.empty_cache() | ||
| generated_ids_after_ptq = None | ||
| if model_type != "llama4" and not is_nemotron_vl_model: | ||
| # Our fake quantizer may not be fully compatible with torch.compile. | ||
| generated_ids_after_ptq = full_model.generate(preview_input_ids, max_new_tokens=100) | ||
| elif is_nemotron_vl_model and tokenizer is not None: | ||
| generated_ids_after_ptq = run_nemotron_vl_preview( | ||
| full_model, | ||
| tokenizer, | ||
| preview_input_ids, | ||
| args.pyt_ckpt_path, | ||
| "after quantization", | ||
| allow_fallback=False, | ||
| ) | ||
| else: | ||
| warnings.warn( | ||
| "Llama4 Maverick generation after quantization has a bug. Skipping generation sample." | ||
| ) |
There was a problem hiding this comment.
Potential NameError when verbose=False.
The variable generated_ids_after_ptq is only defined inside the if args.verbose: block (line 620), but it's referenced at line 661 outside of any conditional. When --no-verbose is passed, this will raise a NameError.
🐛 Proposed fix
Initialize generated_ids_after_ptq before the verbose block:
+ generated_ids_after_ptq = None
if args.verbose:
mtq.print_quant_summary(full_model)
# Run some samples
torch.cuda.empty_cache()
- generated_ids_after_ptq = None
if model_type != "llama4" and not is_nemotron_vl_model:🤖 Prompt for AI Agents
In `@examples/llm_ptq/hf_ptq.py` around lines 618 - 636, Initialize
generated_ids_after_ptq before the verbose/conditional block so it always exists
regardless of args.verbose; e.g., set generated_ids_after_ptq = None prior to
any if args.verbose / generation logic (the symbol to update is
generated_ids_after_ptq and ensure any later references to it after the verbose
block handle the None case or are guarded), leaving the rest of the generation
branches (full_model.generate, run_nemotron_vl_preview, and the Llama4 warning)
unchanged.
| def unpack_weight(self): | ||
| from compressed_tensors.quantization import QuantizationStatus | ||
|
|
||
| if self.quantization_status == QuantizationStatus.COMPRESSED: | ||
| self.weight = nn.Parameter(self.compressor.decompress_module(self), requires_grad=False) | ||
| del self.weight_packed | ||
| del self.weight_scale | ||
|
|
There was a problem hiding this comment.
Guard against missing attributes in unpack_weight.
The del self.weight_packed and del self.weight_scale statements are executed unconditionally, but these attributes may not exist if quantization_status is not COMPRESSED or if they were already deleted. This could raise an AttributeError.
🐛 Proposed fix
def unpack_weight(self):
from compressed_tensors.quantization import QuantizationStatus
if self.quantization_status == QuantizationStatus.COMPRESSED:
self.weight = nn.Parameter(self.compressor.decompress_module(self), requires_grad=False)
- del self.weight_packed
- del self.weight_scale
+ if hasattr(self, "weight_packed"):
+ del self.weight_packed
+ if hasattr(self, "weight_scale"):
+ del self.weight_scale🤖 Prompt for AI Agents
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 605 - 612,
In unpack_weight, avoid unguarded deletions that can raise AttributeError: only
delete self.weight_packed and self.weight_scale if they exist (e.g., check
hasattr(self, "weight_packed") / hasattr(self, "weight_scale")) or perform the
deletes inside the branch where self.quantization_status ==
QuantizationStatus.COMPRESSED after successful decompression; reference the
unpack_weight method and the QuantizationStatus.COMPRESSED check to ensure
deletions occur only when appropriate.
…l/kimik2-thinking Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
66ac7a6 to
e8b7fc6
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #669 +/- ##
=======================================
Coverage 74.19% 74.19%
=======================================
Files 192 192
Lines 19238 19238
=======================================
Hits 14273 14273
Misses 4965 4965 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…l/kimik2-thinking
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
I have updated that in the PR description. |
Did we evaluated accuracies for the checkpoint generated by this flow? |
Yes I did gpqa and aime for the V2. It's close to the previous NVFP4 measured on lbd-lax (though both models -- v2 and the published one -- below core models previous measurement). The quantization flow should be correct. |
Edwardf0t1
left a comment
There was a problem hiding this comment.
LGTM in general, please update the commands in the PR description on generating both v1 and v2 checkpoints, also follow-up with other reviewers' comments and resolve conflicts before merge.
qq: how large is the gap comparing with @janekl 's measurement? |
I take this back. Looks like the benchmark ids are different. I'm regenerating the numbers |
…l/kimik2-thinking
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
8bf37cd to
4200d3c
Compare
What does this PR do?
Type of change: ? new feature
Overview:
Support KIMI K2 Thinking PTQ from the original int4 checkpoint.
Tested with transformers 4.57.1, compressed-tensors 0.12.0
The model weights are dequantized on the fly to save GPU memory
Usage
scripts/huggingface_example.sh --model --quant nvfp4_mlp_only --trust_remote_code
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.