Support KIMI K2 Thinking int4 checkpoint PTQ by cjluo-nv · Pull Request #669 · NVIDIA/Model-Optimizer

cjluo-nv · 2025-12-09T07:45:28Z

What does this PR do?

Type of change: ? new feature

Overview:

Support KIMI K2 Thinking PTQ from the original int4 checkpoint.
Tested with transformers 4.57.1, compressed-tensors 0.12.0

The model weights are dequantized on the fly to save GPU memory

Usage

scripts/huggingface_example.sh --model --quant nvfp4_mlp_only --trust_remote_code

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes

Additional Information

Summary by CodeRabbit

New Features
- Support for nvfp4_mlp_only quantization format, enabling new layer-wise quantization options
- Quantization support for CompressedLinear layers in quantized models
Improvements
- Enhanced quantization for DeepSeek models with improved attention configuration handling
- Optimized model loading with automatic precision configuration and weight unpacking
- Better memory management during model export with automatic cache cleanup
- Conditional sample generation output controlled via verbose mode

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

copy-pr-bot · 2025-12-09T07:45:32Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Edwardf0t1

Is the ckpt generated identical to @jingyu-ml previously generated nvfp4 ckpt?

Edwardf0t1 · 2025-12-09T07:50:27Z

modelopt/torch/quantization/plugins/huggingface.py

    pass

+try:
+    from compressed_tensors.linear.compressed_linear import CompressedLinear


Should we add compressed-tensor as an optional dependency?

@kevalmorabia97 @realAsma what do you think?

If a user is quantizing a model with CompressedLinear, wouldn't they already have compressed-tensors pre-installed? What benefit do we have by having it added as an optional dependency?

compressed-tensors's main dependencies are torch and transformers so should be pretty lightweight to add as a dependency so fine if you want to add. But if its not commonly used by customers, perhaps we can skip it

Can we move this to a seperate file modelopt/torch/quantization/plugins/compressed_tensor.py?

If a user is quantizing a model with CompressedLinear, wouldn't they already have compressed-tensors pre-installed?

This is a good point. +1
Are we planning to have any unit tests for compressed tensor integration?

not right now

Can we move this to a seperate file modelopt/torch/quantization/plugins/compressed_tensor.py?

How strong do you feel about it? Right now I feel this still fall under hf plugins as it's part of the HF's invocation.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

examples/llm_ptq/example_utils.py

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Edwardf0t1 · 2025-12-11T01:18:42Z

@cjluo-nv Did we run any deployment and accuracy test for the ckpt generated with this flow to make sure it's correct? Asking because there's a customer who wants to generate the ckpt by themselves.

In addition, I heard from @jingyu-ml that we need to modify modeling_deepseek.py to enable our PTQ flow.

cjluo-nv · 2025-12-18T18:36:55Z

@cjluo-nv Did we run any deployment and accuracy test for the ckpt generated with this flow to make sure it's correct? Asking because there's a customer who wants to generate the ckpt by themselves.

In addition, I heard from @jingyu-ml that we need to modify modeling_deepseek.py to enable our PTQ flow.

with transformers 4.48, we don't need to modify the original model.

I have not got a chance to validate the checkpoint yet. Will probably continue after Christmas break. @Edwardf0t1 is this an urgent request?

examples/llm_ptq/example_utils.py

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

janekl · 2026-01-07T15:50:23Z

Tested with transformers 4.48

Hello @cjluo-nv. Please specify which exactly transformers version you have tested this with. I've just used 4.48.0 and quickly hit some import issues (for transformers.pytorch_utils module). They that are no longer present in 4.48.3 - I'm trying now this version

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai · 2026-01-14T23:58:16Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Changes extend quantization support for HuggingFace models with a new QuantCompressedLinear module, adjust model loading for pack-quantized format, disable deepseek-specific MLA quantization, refine export/cleanup procedures, and establish bfloat16 as the default precision instead of float16.

Changes

Cohort / File(s)	Summary
Quantization Module Extensions `modelopt/torch/quantization/plugins/huggingface.py`, `modelopt/torch/export/layer_utils.py`, `modelopt/torch/export/quant_utils.py`	Added new `_QuantCompressedLinear` class with TensorQuantizer integration and weight decompression support; expanded `is_quantlinear()` to recognize `QuantCompressedLinear` in addition to `QuantLinear`; extended state dict key filtering to skip "weight_shape" entries during export processing.
Model Loading & Configuration `examples/llm_ptq/example_utils.py`	Enabled pack-quantized format loading with auto device mapping and dtype from HF config; added deepseek-specific quantization adjustments to disable MLA quantization; changed default precision from `torch.float16` to `torch.bfloat16`.
Quantization Pipeline `examples/llm_ptq/hf_ptq.py`, `examples/llm_ptq/scripts/huggingface_example.sh`	Conditionalized post-quantization sample generation to only execute when verbose mode is enabled; expanded valid QFORMAT values to include `nvfp4_mlp_only`.
Export & Cleanup `modelopt/torch/export/unified_export_hf.py`	Added explicit CUDA cache clearing after quantized weight export, weight unpacking for submodules with packed weights, and removal of `hf_quantizer` attribute from model before final save.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title directly describes the main change: adding support for post-training quantization (PTQ) of KIMI K2 Thinking int4 checkpoints, which aligns with the primary objective and the substantial code changes across quantization and export modules.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@examples/llm_ptq/example_utils.py`:
- Around line 354-361: The current check uses
hf_config.quantization_config.get("format", None) which can raise AttributeError
if quantization_config is None or not a dict; instead, defensively read
quant_config first (e.g., quant_config = getattr(hf_config,
"quantization_config", None) or {}), derive quant_format safely (use
dict.get("format") if isinstance(quant_config, dict) else getattr(quant_config,
"format", None)), then compare quant_format == "pack-quantized" before
constructing torch_dtype and calling AutoModelForCausalLM.from_pretrained with
device_map="auto", trust_remote_code and torch_dtype.
- Around line 190-194: The deepseek MLA-disable branch is unreachable because
the `if model_type == "deepseek"` block is nested inside the `if model_type in
["qwen3moe", "qwen3next"] and qformat == "nvfp4":` condition; pull the deepseek
logic out into its own top-level conditional (separate from the
qwen3moe/qwen3next+nvfp4 check) so it runs when `model_type == "deepseek"`, and
keep the existing modifications to `quant_cfg` that set
`quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}` and
`quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}` intact.

In `@examples/llm_ptq/hf_ptq.py`:
- Around line 618-636: Initialize generated_ids_after_ptq before the
verbose/conditional block so it always exists regardless of args.verbose; e.g.,
set generated_ids_after_ptq = None prior to any if args.verbose / generation
logic (the symbol to update is generated_ids_after_ptq and ensure any later
references to it after the verbose block handle the None case or are guarded),
leaving the rest of the generation branches (full_model.generate,
run_nemotron_vl_preview, and the Llama4 warning) unchanged.

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 605-612: In unpack_weight, avoid unguarded deletions that can
raise AttributeError: only delete self.weight_packed and self.weight_scale if
they exist (e.g., check hasattr(self, "weight_packed") / hasattr(self,
"weight_scale")) or perform the deletes inside the branch where
self.quantization_status == QuantizationStatus.COMPRESSED after successful
decompression; reference the unpack_weight method and the
QuantizationStatus.COMPRESSED check to ensure deletions occur only when
appropriate.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 406c18c and 5328ae9.

📒 Files selected for processing (7)

examples/llm_ptq/example_utils.py
examples/llm_ptq/hf_ptq.py
examples/llm_ptq/scripts/huggingface_example.sh
modelopt/torch/export/layer_utils.py
modelopt/torch/export/quant_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/plugins/huggingface.py

🧰 Additional context used

🧬 Code graph analysis (3)

modelopt/torch/export/unified_export_hf.py (1)

modelopt/torch/quantization/plugins/huggingface.py (1)

unpack_weight (605-611)

modelopt/torch/quantization/plugins/huggingface.py (1)

modelopt/torch/quantization/conversion.py (1)

register (330-371)

examples/llm_ptq/hf_ptq.py (1)

examples/llm_ptq/example_utils.py (1)

run_nemotron_vl_preview (49-93)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: wait-checks / wait
GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (8)

modelopt/torch/export/layer_utils.py (1)

346-349: LGTM!

The updated is_quantlinear logic correctly extends detection to include QuantCompressedLinear while maintaining the existing QuantLinear detection and lora exclusion. The implementation aligns with the new _QuantCompressedLinear class added in the quantization plugins.

modelopt/torch/quantization/plugins/huggingface.py (1)

689-697: LGTM!

The registration follows the established pattern used for other optional HuggingFace model types (Llama4, Dbrx, Mixtral, etc.) with proper ImportError handling for the optional compressed-tensors dependency.

modelopt/torch/export/quant_utils.py (1)

884-890: LGTM!

Adding "weight_shape" to the skip_keys set correctly filters out metadata attributes from CompressedLinear modules that shouldn't be included in the exported state dict. This aligns with the new _QuantCompressedLinear support.

modelopt/torch/export/unified_export_hf.py (3)

395-396: LGTM with a note on performance.

The explicit CUDA cache clearing helps prevent OOM during export of large models. Note that calling empty_cache() after every layer can add some overhead for models with many layers, but the memory safety benefit likely outweighs this for the target use case (KIMI K2 which is a large MoE model).

523-525: LGTM!

The weight unpacking step correctly prepares CompressedLinear modules for export by decompressing packed weights before quantization processing. The hasattr guard ensures this only applies to modules with the weight_packed attribute.

597-600: LGTM!

Removing the hf_quantizer attribute before calling save_pretrained ensures that HuggingFace's serialization doesn't interfere with the custom quantized state dict export. The getattr with default None safely handles models without this attribute.

examples/llm_ptq/scripts/huggingface_example.sh (1)

56-58: LGTM!

The nvfp4_mlp_only format is correctly added to both the validation list and the corresponding error message, maintaining consistency with the new quantization format support in hf_ptq.py.

examples/llm_ptq/example_utils.py (1)

382-388: Default dtype change from float16 to bfloat16 looks good.

This is a sensible default for modern GPUs (Ampere and newer) with better numerical stability. The getattr pattern preserves model-specified dtypes when available.

Note: Pre-Ampere GPUs may have limited bfloat16 support, but this should be fine as calibration workloads typically target newer hardware.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-15T00:01:30Z

examples/llm_ptq/example_utils.py

+        if model_type == "deepseek":
+            # Disable MLA quantization for accuracy.
+            quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}
+            quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}
+


⚠️ Potential issue | 🔴 Critical

Deepseek MLA disable logic is unreachable due to incorrect nesting.

The if model_type == "deepseek" block is nested inside the if model_type in ["qwen3moe", "qwen3next"] and qformat == "nvfp4": condition (line 183). Since model_type cannot simultaneously be "deepseek" and in ["qwen3moe", "qwen3next"], this code will never execute.

🐛 Proposed fix - move to a separate conditional block

if model_type in ["qwen3moe", "qwen3next"] and qformat == "nvfp4": # Disable the attention projection layers to retain accuracy quant_cfg["quant_cfg"]["model*.*attn*in_proj*"] = {"enable": False} quant_cfg["quant_cfg"]["model*.*attn*q_proj*"] = {"enable": False} quant_cfg["quant_cfg"]["model*.*attn*k_proj*"] = {"enable": False} quant_cfg["quant_cfg"]["model*.*attn*v_proj*"] = {"enable": False} - if model_type == "deepseek": - # Disable MLA quantization for accuracy. - quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False} - quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False} + if model_type == "deepseek": + # Disable MLA quantization for accuracy. + quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False} + quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False} return quant_cfg

🤖 Prompt for AI Agents

In `@examples/llm_ptq/example_utils.py` around lines 190 - 194, The deepseek MLA-disable branch is unreachable because the `if model_type == "deepseek"` block is nested inside the `if model_type in ["qwen3moe", "qwen3next"] and qformat == "nvfp4":` condition; pull the deepseek logic out into its own top-level conditional (separate from the qwen3moe/qwen3next+nvfp4 check) so it runs when `model_type == "deepseek"`, and keep the existing modifications to `quant_cfg` that set `quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}` and `quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}` intact.

coderabbitai · 2026-01-15T00:01:30Z

examples/llm_ptq/example_utils.py

+        elif hf_config.quantization_config.get("format", None) == "pack-quantized":
+            torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16)
+            model = AutoModelForCausalLM.from_pretrained(
+                ckpt_path,
+                device_map="auto",
+                trust_remote_code=trust_remote_code,
+                torch_dtype=torch_dtype,
+            )


⚠️ Potential issue | 🟠 Major

Potential AttributeError if quantization_config is not present or not a dict.

hf_config.quantization_config.get("format", None) will raise an AttributeError if quantization_config is None or doesn't exist on the config. Consider using getattr with a safe fallback.

🐛 Proposed fix

- elif hf_config.quantization_config.get("format", None) == "pack-quantized": + elif getattr(hf_config, "quantization_config", None) and hf_config.quantization_config.get("format", None) == "pack-quantized": torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16) model = AutoModelForCausalLM.from_pretrained( ckpt_path, device_map="auto", trust_remote_code=trust_remote_code, torch_dtype=torch_dtype, )

Alternatively, use a more defensive pattern:

elif getattr(getattr(hf_config, "quantization_config", {}), "get", lambda k, d: d)("format", None) == "pack-quantized":

Or extract the check:

quant_config = getattr(hf_config, "quantization_config", None) or {} quant_format = quant_config.get("format") if isinstance(quant_config, dict) else getattr(quant_config, "format", None) ... elif quant_format == "pack-quantized":

🤖 Prompt for AI Agents

In `@examples/llm_ptq/example_utils.py` around lines 354 - 361, The current check uses hf_config.quantization_config.get("format", None) which can raise AttributeError if quantization_config is None or not a dict; instead, defensively read quant_config first (e.g., quant_config = getattr(hf_config, "quantization_config", None) or {}), derive quant_format safely (use dict.get("format") if isinstance(quant_config, dict) else getattr(quant_config, "format", None)), then compare quant_format == "pack-quantized" before constructing torch_dtype and calling AutoModelForCausalLM.from_pretrained with device_map="auto", trust_remote_code and torch_dtype.

coderabbitai · 2026-01-15T00:01:30Z

examples/llm_ptq/hf_ptq.py

+        # Run some samples
+        torch.cuda.empty_cache()
+        generated_ids_after_ptq = None
+        if model_type != "llama4" and not is_nemotron_vl_model:
+            # Our fake quantizer may not be fully compatible with torch.compile.
+            generated_ids_after_ptq = full_model.generate(preview_input_ids, max_new_tokens=100)
+        elif is_nemotron_vl_model and tokenizer is not None:
+            generated_ids_after_ptq = run_nemotron_vl_preview(
+                full_model,
+                tokenizer,
+                preview_input_ids,
+                args.pyt_ckpt_path,
+                "after quantization",
+                allow_fallback=False,
+            )
+        else:
+            warnings.warn(
+                "Llama4 Maverick generation after quantization has a bug. Skipping generation sample."
+            )


⚠️ Potential issue | 🔴 Critical

Potential NameError when verbose=False.

The variable generated_ids_after_ptq is only defined inside the if args.verbose: block (line 620), but it's referenced at line 661 outside of any conditional. When --no-verbose is passed, this will raise a NameError.

🐛 Proposed fix

Initialize generated_ids_after_ptq before the verbose block:

+ generated_ids_after_ptq = None if args.verbose: mtq.print_quant_summary(full_model) # Run some samples torch.cuda.empty_cache() - generated_ids_after_ptq = None if model_type != "llama4" and not is_nemotron_vl_model:

🤖 Prompt for AI Agents

In `@examples/llm_ptq/hf_ptq.py` around lines 618 - 636, Initialize generated_ids_after_ptq before the verbose/conditional block so it always exists regardless of args.verbose; e.g., set generated_ids_after_ptq = None prior to any if args.verbose / generation logic (the symbol to update is generated_ids_after_ptq and ensure any later references to it after the verbose block handle the None case or are guarded), leaving the rest of the generation branches (full_model.generate, run_nemotron_vl_preview, and the Llama4 warning) unchanged.

coderabbitai · 2026-01-15T00:01:30Z

modelopt/torch/quantization/plugins/huggingface.py

+    def unpack_weight(self):
+        from compressed_tensors.quantization import QuantizationStatus
+
+        if self.quantization_status == QuantizationStatus.COMPRESSED:
+            self.weight = nn.Parameter(self.compressor.decompress_module(self), requires_grad=False)
+        del self.weight_packed
+        del self.weight_scale
+


⚠️ Potential issue | 🟡 Minor

Guard against missing attributes in unpack_weight.

The del self.weight_packed and del self.weight_scale statements are executed unconditionally, but these attributes may not exist if quantization_status is not COMPRESSED or if they were already deleted. This could raise an AttributeError.

🐛 Proposed fix

def unpack_weight(self): from compressed_tensors.quantization import QuantizationStatus if self.quantization_status == QuantizationStatus.COMPRESSED: self.weight = nn.Parameter(self.compressor.decompress_module(self), requires_grad=False) - del self.weight_packed - del self.weight_scale + if hasattr(self, "weight_packed"): + del self.weight_packed + if hasattr(self, "weight_scale"): + del self.weight_scale

🤖 Prompt for AI Agents

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 605 - 612, In unpack_weight, avoid unguarded deletions that can raise AttributeError: only delete self.weight_packed and self.weight_scale if they exist (e.g., check hasattr(self, "weight_packed") / hasattr(self, "weight_scale")) or perform the deletes inside the branch where self.quantization_status == QuantizationStatus.COMPRESSED after successful decompression; reference the unpack_weight method and the QuantizationStatus.COMPRESSED check to ensure deletions occur only when appropriate.

…l/kimik2-thinking Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

codecov · 2026-01-15T00:16:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.19%. Comparing base (21a4010) to head (4200d3c).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #669   +/-   ##
=======================================
  Coverage   74.19%   74.19%           
=======================================
  Files         192      192           
  Lines       19238    19238           
=======================================
  Hits        14273    14273           
  Misses       4965     4965

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

…l/kimik2-thinking

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv · 2026-01-15T17:16:38Z

Tested with transformers 4.48

Hello @cjluo-nv. Please specify which exactly transformers version you have tested this with. I've just used 4.48.0 and quickly hit some import issues (for transformers.pytorch_utils module). They that are no longer present in 4.48.3 - I'm trying now this version

I have updated that in the PR description.

Edwardf0t1 · 2026-01-16T07:23:08Z

@cjluo-nv Did we run any deployment and accuracy test for the ckpt generated with this flow to make sure it's correct? Asking because there's a customer who wants to generate the ckpt by themselves.
In addition, I heard from @jingyu-ml that we need to modify modeling_deepseek.py to enable our PTQ flow.

with transformers 4.48, we don't need to modify the original model.

I have not got a chance to validate the checkpoint yet. Will probably continue after Christmas break. @Edwardf0t1 is this an urgent request?

Did we evaluated accuracies for the checkpoint generated by this flow?

cjluo-nv · 2026-01-16T16:02:57Z

@cjluo-nv Did we run any deployment and accuracy test for the ckpt generated with this flow to make sure it's correct? Asking because there's a customer who wants to generate the ckpt by themselves.
In addition, I heard from @jingyu-ml that we need to modify modeling_deepseek.py to enable our PTQ flow.

with transformers 4.48, we don't need to modify the original model.
I have not got a chance to validate the checkpoint yet. Will probably continue after Christmas break. @Edwardf0t1 is this an urgent request?

Did we evaluated accuracies for the checkpoint generated by this flow?

Yes I did gpqa and aime for the V2. It's close to the previous NVFP4 measured on lbd-lax (though both models -- v2 and the published one -- below core models previous measurement). The quantization flow should be correct.

Edwardf0t1

LGTM in general, please update the commands in the PR description on generating both v1 and v2 checkpoints, also follow-up with other reviewers' comments and resolve conflicts before merge.

examples/llm_ptq/README.md

Edwardf0t1 · 2026-01-20T18:05:54Z

@cjluo-nv Did we run any deployment and accuracy test for the ckpt generated with this flow to make sure it's correct? Asking because there's a customer who wants to generate the ckpt by themselves.
In addition, I heard from @jingyu-ml that we need to modify modeling_deepseek.py to enable our PTQ flow.

with transformers 4.48, we don't need to modify the original model.
I have not got a chance to validate the checkpoint yet. Will probably continue after Christmas break. @Edwardf0t1 is this an urgent request?

Did we evaluated accuracies for the checkpoint generated by this flow?

Yes I did gpqa and aime for the V2. It's close to the previous NVFP4 measured on lbd-lax (though both models -- v2 and the published one -- below core models previous measurement). The quantization flow should be correct.

qq: how large is the gap comparing with @janekl 's measurement?

cjluo-nv · 2026-01-20T18:42:13Z

Yes I did gpqa and aime for the V2. It's close to the previous NVFP4 measured on lbd-lax (though both models -- v2 and the published one -- below core models previous measurement). The quantization flow should be correct.

I take this back. Looks like the benchmark ids are different. I'm regenerating the numbers

jingyu-ml

LGTM

…l/kimik2-thinking

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Support KIMI K2 Thinking int4 checkpoint PTQ

998856b

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Edwardf0t1 reviewed Dec 9, 2025

View reviewed changes

cjluo-nv requested review from jingyu-ml, kevalmorabia97 and realAsma December 9, 2025 17:27

Fix

3aebac7

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv marked this pull request as ready for review December 9, 2025 17:29

cjluo-nv requested review from a team as code owners December 9, 2025 17:29

cjluo-nv requested a review from meenchen December 9, 2025 17:29

meenchen reviewed Dec 9, 2025

View reviewed changes

examples/llm_ptq/example_utils.py Outdated Show resolved Hide resolved

Fix

95ee275

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv requested a review from meenchen December 10, 2025 00:12

meenchen approved these changes Dec 10, 2025

View reviewed changes

shengliangxu reviewed Dec 19, 2025

View reviewed changes

examples/llm_ptq/example_utils.py Show resolved Hide resolved

Fix export

a09d86f

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv requested a review from a team as a code owner January 5, 2026 23:50

cjluo-nv requested a review from sugunav14 January 5, 2026 23:50

Fix

09c12af

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Fix export

295bbb7

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv force-pushed the chenjiel/kimik2-thinking branch from 5328ae9 to 66ac7a6 Compare January 15, 2026 00:01

coderabbitai bot reviewed Jan 15, 2026

View reviewed changes

Merge branch 'main' of github.com:NVIDIA/Model-Optimizer into chenjie…

e8b7fc6

…l/kimik2-thinking Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv force-pushed the chenjiel/kimik2-thinking branch from 66ac7a6 to e8b7fc6 Compare January 15, 2026 00:02

cjluo-nv added 5 commits January 15, 2026 16:21

Update

bfb9a14

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Fix

55c7224

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Fix

be81219

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/Model-Optimizer into chenjie…

dc33337

…l/kimik2-thinking

Update doc

aa7f8be

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv enabled auto-merge (squash) January 15, 2026 19:10

Edwardf0t1 approved these changes Jan 20, 2026

View reviewed changes

examples/llm_ptq/README.md Show resolved Hide resolved

jingyu-ml approved these changes Jan 20, 2026

View reviewed changes

cjluo-nv added 2 commits January 21, 2026 06:06

Merge branch 'main' of github.com:NVIDIA/Model-Optimizer into chenjie…

e6ffcce

…l/kimik2-thinking

Add compressed tensors

4200d3c

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv force-pushed the chenjiel/kimik2-thinking branch from 8bf37cd to 4200d3c Compare January 21, 2026 07:34

cjluo-nv merged commit 615f99e into main Jan 21, 2026
36 checks passed

cjluo-nv deleted the chenjiel/kimik2-thinking branch January 21, 2026 09:32

Conversation

cjluo-nv commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Dec 9, 2025

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Edwardf0t1 commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjluo-nv commented Dec 18, 2025

Uh oh!

Uh oh!

janekl commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cjluo-nv commented Jan 15, 2026

Uh oh!

Edwardf0t1 commented Jan 16, 2026

Uh oh!

cjluo-nv commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Edwardf0t1 commented Jan 20, 2026

Uh oh!

cjluo-nv commented Jan 20, 2026

cjluo-nv commented Dec 9, 2025 •

edited

Loading

Edwardf0t1 commented Dec 11, 2025 •

edited

Loading

janekl commented Jan 7, 2026 •

edited

Loading

coderabbitai bot commented Jan 14, 2026 •

edited

Loading

codecov bot commented Jan 15, 2026 •

edited

Loading

cjluo-nv commented Jan 16, 2026 •

edited

Loading