Skip to content

Conversation

@pmattione-nvidia
Copy link
Contributor

@pmattione-nvidia pmattione-nvidia commented Dec 10, 2025

This changes the parquet decode to skip decoding work up to first_row for non-lists. We were already skipping ahead for lists, which was easy to do because there is a lot of list preprocessing that tells us exactly how far to skip ahead.

We don't have that for non-lists with nulls, so we have to do some work. Specifically, if we skip N rows, we don't know how many valids we are skipping so we have to loop through and figure that out. This loop (with the call to skip_validity_and_row_indices_nonlist()) is basically the same as the normal decode processing with all of the extra superfluous code/work stripped out. We then can skip ahead with the dictionary and bool stream decode as before.

This does not immediately improve performance, as the bottlenecks are elsewhere. However, skipping ahead simplifies the rest of the code, removing weird corner cases (see removed if-checks) in the code that were necessary without the skipping. This will make further simplifications and optimizations to the code easier later on.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@pmattione-nvidia pmattione-nvidia self-assigned this Dec 10, 2025
@pmattione-nvidia pmattione-nvidia requested a review from a team as a code owner December 10, 2025 23:28
@pmattione-nvidia pmattione-nvidia added the Performance Performance related issue label Dec 10, 2025
@pmattione-nvidia pmattione-nvidia added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 10, 2025
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 10, 2025
}();

// dst_pos may be negative (values before first_row) for non-lists.
if (dst_pos >= 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dst_pos can no longer be negative (it was negative before because we subtract first_row but we start there now). This and the following section look like a bunch of changes, but it's just white-space from removing the if.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint. Hiding whitespaces def helps here

}();

// dst_pos may be negative (values before first_row) for non-lists.
if (dst_pos >= 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes in this whole section except removing the if and whitespace changes from shifting the code in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to say: what the heck are all these changes? Then I realized it was just indentation. :)

* @return Maximum depth valid count after processing
*/
template <int decode_block_size, typename level_t, bool is_nested, int rolling_buf_size>
__device__ int skip_validity_and_row_indices_nonlist(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a gutted version of update_validity_and_row_indices_flat(), where all of the unnecessary code has been removed (e.g. filling validity flags).


// sum null counts. we have to do it this way instead of just incrementing by (value_count -
// valid_count) because valid_count also includes rows that potentially start before our row
// bounds. if we could come up with a way to clean that up, we could remove this and just
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change "cleans that up"


// sum null counts. we have to do it this way instead of just incrementing by (value_count -
// valid_count) because valid_count also includes rows that potentially start before our row
// bounds. if we could come up with a way to clean that up, we could remove this and just
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change "cleans that up"

}();

// lookup input string pointer & length. store length.
bool const in_range = (thread_pos < target_pos) && (dst_pos >= 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dst_pos can no longer be negative, same as in the other decode functions

if (skipped_leaf_values > 0) {
if (should_process_nulls) {
skip_decode<rolling_buf_size>(def_decoder, skipped_leaf_values, t);
// Skip ahead in the decoding so that we don't repeat work
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the actual skipping. the list skipping is refactored slightly to share common code with non-lists (skip_bools()).


// Non-lists
int const first_row = s->first_row;
if (first_row <= 0) { return; }
Copy link
Contributor Author

@pmattione-nvidia pmattione-nvidia Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the non-list skipping. Note that if we don't process nulls we know exactly how far to skip and no looping is needed.

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial pass and so far so good. One small question

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments; looks good otherwise. Waiting for all tests to pass and a final pass before approve

@vuule vuule self-requested a review December 12, 2025 03:49
}();

// dst_pos may be negative (values before first_row) for non-lists.
if (dst_pos >= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to say: what the heck are all these changes? Then I realized it was just indentation. :)

if (skipped_leaf_values > 0) {
if (should_process_nulls) {
skip_decode<rolling_buf_size>(def_decoder, skipped_leaf_values, t);
// Skip ahead in the decoding so that we don't repeat work
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opinion: this kernel has a whole lot of code in one function. Might be worth refactoring this whole skip-stuff block into it's own function if it's not too disruptive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that first, but it ends up needing like 12 template arguments and passes like 8 parameters so the function call is about 20 lines long with the formatting and it ends up being grotesque. To make it simpler I tried using auto for some of the type parameters and it wouldn't compile. Using a lambda with & was just way easier and cleaner.

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Don't have any more suggestions but I would still like to see that lambda go away if possible 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants