-
Notifications
You must be signed in to change notification settings - Fork 843
fix: where comparing old and new bitmap versions occurred when directly comparing bytes in join and group by operations. #19082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request fixes a critical bug where bitmap comparisons in join and group by operations could produce incorrect results when comparing data serialized with the old RoaringBitmap format against data using the new HybridBitmap format. The fix ensures all bitmap data is normalized to the hybrid encoding format before comparison operations.
Key changes:
- Introduces a
normalize_bitmap_columnutility function that detects legacy-encoded bitmaps and converts them to the current hybrid format - Updates join and group by operations to normalize bitmap columns before performing byte-level comparisons
- Adds comprehensive unit tests verifying that legacy and hybrid-encoded bitmaps with identical data produce identical hashes and comparison results
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/query/expression/src/utils/mod.rs | Adds new bitmap utility module |
| src/query/expression/src/utils/bitmap.rs | Implements bitmap normalization logic to convert legacy RoaringBitmap format to HybridBitmap format |
| src/query/expression/src/aggregate/payload_row.rs | Updates bitmap serialization and matching logic to normalize bitmaps before comparison; adds test |
| src/query/expression/src/aggregate/group_hash.rs | Updates bitmap hashing logic to normalize bitmaps before computing hash values; adds test |
| src/query/expression/Cargo.toml | Adds roaring dependency for test cases |
| Cargo.lock | Updates lock file with roaring dependency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
817126e to
fef969f
Compare
|
I have a feeling that it's not overly complicated here? Compatibility issues are something that will naturally diminish over time. I think it only needs to be dealt with where the old data is ingested, not all places. It is recommended to refer to those locations where |
b24e07a to
94a7c19
Compare
|
Codex couldn't complete this request. Try again later. |
4eb2698 to
426629e
Compare
dd789cf to
e19a667
Compare
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ly comparing bytes in join and group by operations.
Co-authored-by: Copilot <[email protected]>
60ad320 to
ecf9907
Compare
ecf9907 to
b654f5a
Compare
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
databend/src/query/expression/src/aggregate/payload_row.rs
Lines 446 to 449 in f46faf0
| Column::Bitmap(v) | Column::Binary(v) | Column::Variant(v) | Column::Geometry(v) => { | |
| self.match_validity_with(counts, validity, |row, row_ptr| unsafe { | |
| let value = BinaryType::index_column_unchecked(v, row.to_usize()); | |
| row_ptr.is_bytes_eq(col_offset, value) |
Bitmap columns are normalized to the hybrid encoding when rows are serialized, but comparison still reads raw bytes from the incoming column and compares them byte-for-byte to the stored normalized bytes. When the input block contains legacy bitmap serialization, the normalized row data will no longer match the unnormalized column value, causing identical keys to be treated as distinct during row matching (e.g., duplicate groups or join mismatches).
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
Following the pull request (#19041), the data structure used by Bitmap was changed from RoaringBitmap to HybridBitmap. Therefore, the serialization method changed. Even though deserialization is compatible with the old data, there are still instances where comparisons are made directly using bytes during group by or join operations. This can lead to inconsistent matching results for semantically identical data when comparing old and new bitmap data.
Tests
Type of change
This change is