-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Deterministic data-only CRC (2/2) - use data-only CRC for realtime committing segments during replace #17380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17380 +/- ##
============================================
+ Coverage 63.22% 63.26% +0.04%
Complexity 1475 1475
============================================
Files 3147 3161 +14
Lines 187575 188440 +865
Branches 28712 28837 +125
============================================
+ Hits 118590 119220 +630
- Misses 59794 59977 +183
- Partials 9191 9243 +52
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@Jackie-Jiang PTAL as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces intelligent segment replacement logic by adding a useDataCrcIfAvailable parameter to distinguish between data changes and index-only changes. When enabled, the system avoids unnecessary segment downloads if only indexes have changed while the underlying data remains the same.
Key changes:
- Modified
replaceSegmentIfCrcMismatchmethod to accept auseDataCrcIfAvailableboolean parameter - Updated
hasSameCRCmethod to implement fallback logic: check full CRC first, then optionally check data-only CRC - Applied data CRC logic selectively: enabled for CONSUMING→ONLINE transitions, offline segment additions, and server startup; disabled for explicit reload/replace operations
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| BaseTableDataManager.java | Added useDataCrcIfAvailable parameter to replaceSegmentIfCrcMismatch and hasSameCRC methods; implemented fallback CRC comparison logic |
| RealtimeTableDataManager.java | Enabled data CRC usage (true) for CONSUMING→ONLINE segment transitions |
| OfflineTableDataManager.java | Enabled data CRC usage (true) for offline segment additions |
| BaseTableDataManagerTest.java | Added test coverage for new CRC comparison scenarios including both CRCs matching, data CRC matching with index changes, and data CRC unavailability |
| BaseTableDataManagerEnqueueSegmentToReplaceTest.java | Updated mock verifications to include the new useDataCrcIfAvailable parameter |
| return zkMetadata.getDataCrc() >= 0 | ||
| && Long.parseLong(localMetadata.getDataCrc()) >= 0 | ||
| && zkMetadata.getDataCrc() == Long.parseLong(localMetadata.getDataCrc()); |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code may throw NumberFormatException if localMetadata.getDataCrc() returns a non-numeric string. The check should validate that the string is parseable before attempting to parse it, or the method should use a try-catch block to handle potential parsing failures gracefully.
| return zkMetadata.getDataCrc() >= 0 | |
| && Long.parseLong(localMetadata.getDataCrc()) >= 0 | |
| && zkMetadata.getDataCrc() == Long.parseLong(localMetadata.getDataCrc()); | |
| if (zkMetadata.getDataCrc() < 0) { | |
| return false; | |
| } | |
| String localDataCrc = localMetadata.getDataCrc(); | |
| if (!StringUtils.isNumeric(localDataCrc)) { | |
| LOGGER.warn("Local segment metadata data CRC is not numeric: {}", localDataCrc); | |
| return false; | |
| } | |
| long localDataCrcValue; | |
| try { | |
| localDataCrcValue = Long.parseLong(localDataCrc); | |
| } catch (NumberFormatException e) { | |
| // Guard against values that are numeric but outside the range of a long | |
| LOGGER.warn("Failed to parse local segment metadata data CRC: {}", localDataCrc, e); | |
| return false; | |
| } | |
| if (localDataCrcValue < 0) { | |
| return false; | |
| } | |
| return zkMetadata.getDataCrc() == localDataCrcValue; |
| _logger.warn("Segment: {} has CRC changed from: {} to: {} and data CRC prev value: {} new value: {}", | ||
| segmentName, | ||
| segmentMetadata.getCrc(), | ||
| zkMetadata.getCrc(), | ||
| segmentMetadata.getDataCrc(), | ||
| zkMetadata.getDataCrc()); |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message uses positional parameters but the order is inconsistent with the description. 'from: {} to: {}' should show old then new, but the code logs local metadata first, then ZK metadata. Consider reordering to 'from: {} (local) to: {} (ZK)' for clarity, or swap the order of getCrc() calls to match the 'from→to' semantic.
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to discuss when should we use data CRC instead of index CRC, and what is the side effect.
When using data CRC, index only change happening in the deep store (i.e. new index added) won't be honored. This could prevent users from creating the index from minion and reduce the index creation on server.
Given we want to solve the problem of real-time committed segment potentially having different CRC, I feel a better way to address this is to add a flag in ZK metadata to indicate that we can check only the data CRC. This flag only exists in committed segment, but not segment pushed from other ingestion flow
Thanks @Jackie-Jiang , we can do that. For my understanding, in the current code, I'm only using Data CRC in For the other flows of reload segment, replace segments (used by minions), data CRC is not used. So, in a way, is the code already handling this point ? Or are there other flows that might be accidentally included in this ? |
These helix state transitions apply to a lot of scenarios, not only for the committed segments. E.g. when server starts, all segments are loaded through these 2 state transitions, which will make server ignore changes applied by minions. |
@Jackie-Jiang Makes sense. Updated the logic to persist an optional ZK flag for realtime committing segments that will be used to tell the replicas to use data CRC for replace. PTAL. |
| public static final String TOTAL_DOCS = "segment.total.docs"; | ||
| public static final String CRC = "segment.crc"; | ||
| public static final String DATA_CRC = "segment.data.crc"; | ||
| public static final String USE_DATA_CRC_FOR_REPLACE = "segment.useDataCRCForReplace"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we keep the name consistent with other segment configs? segment.replacement.strategy.data.crc can be one suggestion.
Motivation
Resolves #17262
Summary
This PR is a follow up of #17264 that added a computation of data-only CRC for segments and uploaded it to ZK. It introduces a new property in ZK metadata for segments
useDataCrcForReplaceonly for realtime committing segments. If this flag is present, the replica servers when downloading and replacing the realtime segments will end up using the data CRC (if also available in ZK) to carry out in place replace.Details
When This Optimization Applies
Only for realtime segments during CONSUMING → ONLINE transition
Only when the flag is set by the controller during segment commit
Only on replica servers that have built the segment locally
Fallback behavior: If data CRCs also don't match, segment is still downloaded (safe)
When Standard Behavior Still Applies
Offline table segments (flag never set)
Uploaded LLC segments (flag never set)
Manual segment replacement operations
When instance.check.crc.on.segment.load is disabled
Testing
UTs for replaceSegmentIFCRCMismatch() added.
Ran quick start for realtime tables,
Data CRC and useDataCRC flag present for realtime committed segments,

Flag not present for IN_PROGRESS consuming segments,

Data CRC present but Flag NOT present for offline table segments,
