Skip to content

Conversation

@KKould
Copy link
Member

@KKould KKould commented Dec 9, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Following the pull request (#19041), the data structure used by Bitmap was changed from RoaringBitmap to HybridBitmap. Therefore, the serialization method changed. Even though deserialization is compatible with the old data, there are still instances where comparisons are made directly using bytes during group by or join operations. This can lead to inconsistent matching results for semantically identical data when comparing old and new bitmap data.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@KKould KKould self-assigned this Dec 9, 2025
@github-actions github-actions bot added the pr-bugfix this PR patches a bug in codebase label Dec 9, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a critical bug where bitmap comparisons in join and group by operations could produce incorrect results when comparing data serialized with the old RoaringBitmap format against data using the new HybridBitmap format. The fix ensures all bitmap data is normalized to the hybrid encoding format before comparison operations.

Key changes:

  • Introduces a normalize_bitmap_column utility function that detects legacy-encoded bitmaps and converts them to the current hybrid format
  • Updates join and group by operations to normalize bitmap columns before performing byte-level comparisons
  • Adds comprehensive unit tests verifying that legacy and hybrid-encoded bitmaps with identical data produce identical hashes and comparison results

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/query/expression/src/utils/mod.rs Adds new bitmap utility module
src/query/expression/src/utils/bitmap.rs Implements bitmap normalization logic to convert legacy RoaringBitmap format to HybridBitmap format
src/query/expression/src/aggregate/payload_row.rs Updates bitmap serialization and matching logic to normalize bitmaps before comparison; adds test
src/query/expression/src/aggregate/group_hash.rs Updates bitmap hashing logic to normalize bitmaps before computing hash values; adds test
src/query/expression/Cargo.toml Adds roaring dependency for test cases
Cargo.lock Updates lock file with roaring dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@KKould KKould marked this pull request as ready for review December 9, 2025 11:01
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@KKould
Copy link
Member Author

KKould commented Dec 9, 2025

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@KKould KKould force-pushed the fix/bitmap_bytes_eq branch from 817126e to fef969f Compare December 10, 2025 09:47
@forsaken628
Copy link
Collaborator

forsaken628 commented Dec 10, 2025

I have a feeling that it's not overly complicated here? Compatibility issues are something that will naturally diminish over time. I think it only needs to be dealt with where the old data is ingested, not all places.

It is recommended to refer to those locations where DataBlock::consume_convert_to_full appears.

@KKould
Copy link
Member Author

KKould commented Dec 11, 2025

Before commit adc9695, it was implemented as you described. However, following @sundy-li suggestion, we attempted to pre-serialize the Bitmap in this PR to address a series of previous issues. Please refer to the updated PR description for details.

@KKould KKould force-pushed the fix/bitmap_bytes_eq branch 2 times, most recently from b24e07a to 94a7c19 Compare December 11, 2025 03:58
@BohuTANG
Copy link
Member

Before commit adc9695, it was implemented as you described. However, following @sundy-li suggestion, we attempted to pre-serialize the Bitmap in this PR to address a series of previous issues. Please refer to the updated PR description for details.

@codex address this feedback

@chatgpt-codex-connector
Copy link

Codex couldn't complete this request. Try again later.

@KKould KKould force-pushed the fix/bitmap_bytes_eq branch from 4eb2698 to 426629e Compare December 12, 2025 07:32
@KKould KKould force-pushed the fix/bitmap_bytes_eq branch 2 times, most recently from dd789cf to e19a667 Compare December 13, 2025 12:02
@KKould
Copy link
Member Author

KKould commented Dec 15, 2025

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@KKould KKould force-pushed the fix/bitmap_bytes_eq branch 2 times, most recently from 60ad320 to ecf9907 Compare December 21, 2025 22:25
@KKould KKould force-pushed the fix/bitmap_bytes_eq branch from ecf9907 to b654f5a Compare December 21, 2025 22:30
@forsaken628
Copy link
Collaborator

@KKould KKould requested a review from forsaken628 December 22, 2025 06:56
@KKould
Copy link
Member Author

KKould commented Dec 22, 2025

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Column::Bitmap(v) | Column::Binary(v) | Column::Variant(v) | Column::Geometry(v) => {
self.match_validity_with(counts, validity, |row, row_ptr| unsafe {
let value = BinaryType::index_column_unchecked(v, row.to_usize());
row_ptr.is_bytes_eq(col_offset, value)

P1 Badge Row comparison mismatches normalized and legacy bitmap bytes

Bitmap columns are normalized to the hybrid encoding when rows are serialized, but comparison still reads raw bytes from the incoming column and compares them byte-for-byte to the stored normalized bytes. When the input block contains legacy bitmap serialization, the normalized row data will no longer match the unnormalized column value, causing identical keys to be treated as distinct during row matching (e.g., duplicate groups or join mismatches).

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@KKould KKould merged commit fac7a9d into databendlabs:main Dec 22, 2025
86 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix this PR patches a bug in codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants