feat: [iceberg] CometExecRDD supports per-partition plan data, Iceberg native scan with DPP #3349

mbutrovich · 2026-01-30T17:52:52Z

Continues efforts from #3295, #3297, and #3301, building on #3295's diff. I asked Claude to summarize the diff and draft the PR description for me:

Which issue does this PR close?

Closes #.

Rationale for this change

This PR continues the work from #3295, addressing two problems with Iceberg native scan:

Serialization overhead: All Iceberg FileScanTask data was serialized into the protobuf plan at planning time and distributed to every executor. For tables with many files, this creates significant overhead since every task receives the full plan containing all partitions' tasks.
No DPP support: Dynamic Partition Pruning couldn't work because partition data was serialized at planning time, before DPP subqueries execute. Now that we'e deferring some parts of plan generation, that opens the door for DPP.

What changes are included in this PR?

Partition data distribution (solves plan bloat problem)

Planning now creates a placeholder IcebergScan with only metadata_location (for matching)
serializePartitions() runs at execution time after DPP resolves, producing:
commonData: shared across partitions (catalog properties, schema, pools)
perPartitionData: array of serialized FileScanTask data, one per partition
CometExecPartition carries only its partition's data; PlanDataInjector injects it into the operator tree at execution time
Cache the commonData parsing to eliminate repeated allocations for PartitionData and PartitionValue

Unified RDD (replaces ZippedPartitionsRDD)

ZippedPartitionsRDD is removed; CometExecRDD now handles all cases:
With inputs: zips them via inputPartitions (equivalent to ZippedPartitionsBaseRDD)
Without inputs: uses numPartitions to create partitions (standalone scan case)
This is equivalent because:
getPartitions creates CometExecPartition with inputPartitions array from each input RDD at that index
compute iterates each input RDD with its corresponding partition (same as ZippedPartitionsRDD.compute)
getDependencies returns OneToOneDependency for each input (same dependency semantics)
getPreferredLocations uses intersection-then-union logic (duplicated from ZippedPartitionsBaseRDD)

Dynamic Partition Pruning support

CometIcebergNativeScanExec.doPrepare() triggers DPP subquery preparation
serializedPartitionData is lazy: waits for DPP values, then serializes only filtered partitions
Handles SubqueryAdaptiveBroadcastExec via reflection to set InSubqueryExec.result
Added Spark version shims for SubqueryAdaptiveBroadcastExec.index vs .indices (SPARK-46946)

CometNativeExec iterator creation

Moved from inline closure to CometExecRDD.compute()
findAllPlanData traverses the plan tree to collect planning data, stopping at stage boundaries
collectSubqueries gathers ScalarSubquery expressions for registration with CometScalarSubquery
Both are equivalent to the previous behavior but structured for the new RDD

How are these changes tested?

Extended CometIcebergNativeSuite with DPP tests:
runtime filtering - join with dynamic partition pruning: verifies DPP prunes partitions (3 → 1)
runtime filtering - multiple DPP filters on two partition columns: tests multi-column DPP
CI runs full test suite including Iceberg Java TestRuntimeFiltering tests

codecov-commenter · 2026-01-30T18:04:44Z

Codecov Report

❌ Patch coverage is 81.06796% with 78 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.17%. Comparing base (f09f8af) to head (9cc541a).
⚠️ Report is 918 commits behind head on main.

Files with missing lines	Patch %	Lines
...e/spark/sql/comet/CometIcebergNativeScanExec.scala	61.53%	10 Missing and 20 partials ⚠️
...n/scala/org/apache/comet/rules/CometScanRule.scala	21.73%	11 Missing and 7 partials ⚠️
.../comet/serde/operator/CometIcebergNativeScan.scala	89.79%	13 Missing and 2 partials ⚠️
...n/scala/org/apache/spark/sql/comet/operators.scala	87.67%	2 Missing and 7 partials ⚠️
...cala/org/apache/spark/sql/comet/CometExecRDD.scala	93.33%	3 Missing and 3 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3349      +/-   ##
============================================
+ Coverage     56.12%   60.17%   +4.05%     
- Complexity      976     1491     +515     
============================================
  Files           119      174      +55     
  Lines         11743    16348    +4605     
  Branches       2251     2713     +462     
============================================
+ Hits           6591     9838    +3247     
- Misses         4012     5129    +1117     
- Partials       1140     1381     +241

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…loop.

…oned columns) and run a representative test.

…er assertion in DPP test.

andygrove · 2026-01-30T20:49:28Z

native/proto/src/lib.rs


 // Include generated modules from .proto files.
 #[allow(missing_docs)]
+#[allow(clippy::large_enum_variant)]


we should probably have a follow up issue to revisit the choice to ignore this warning

spark/src/main/scala/org/apache/comet/serde/operator/CometIcebergNativeScan.scala

1. findAllIcebergSplitData() collected perPartitionByLocation (all partitions' data) 2. This map was captured in the createCometExecIter closure 3. ZippedPartitionsRDD serialized that closure to every task 4. Each task received ALL partitions' data (925 bytes to both tasks) Instead we now use CometIcebergSplitRDD which puts per-partition data in Partition objects.

mbutrovich · 2026-01-30T21:28:36Z

Even though CI is green, gonna mark this as draft since I might still keep refactoring a bit to clean things up and don't want it merged early.

…hat every partition doesn't come over.

… columns), fixes TestRuntimeFiltering Iceberg Java tests with column renames. CometIcebergSplitRDD registers subqueries so native code can look them up, fixes TestViews Iceberg Java tests with rewritten filter.

…, add more docs.

… assertion at index lookup, and defensive fallback if future Spark behavior changes.

# Conflicts: # .github/workflows/iceberg_spark_test.yml

spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scala

Continue apache#3295, experimental DPP support.

ee1ccf8

mbutrovich added 4 commits January 30, 2026 13:30

Remove unnecessary steps in convert(), hoist reflection calls out of …

70ad4f5

…loop.

scalastyle

e820616

Fix scenario with multiple DPP expressions (i.e., join on two partiti…

b08abe0

…oned columns) and run a representative test.

Docs.

9dc84f9

mbutrovich changed the title ~~feat: [WIP] Per-Partition Plan Building for Native Iceberg Scans with DPP [iceberg-rust]~~ feat: Per-Partition Plan Building for Native Iceberg Scans with DPP [iceberg-rust] Jan 30, 2026

mbutrovich mentioned this pull request Jan 30, 2026

feat: Per-Partition Plan Building for Native Iceberg Scans [iceberg-rust] #3295

Closed

5 tasks

mbutrovich changed the title ~~feat: Per-Partition Plan Building for Native Iceberg Scans with DPP [iceberg-rust]~~ feat: [iceberg-rust] Iceberg native scan does per-partition task serialization with DPP Jan 30, 2026

mbutrovich added 2 commits January 30, 2026 15:03

Throw an exception on reflection error in setInSubqueryResult, strong…

b3c7c79

…er assertion in DPP test.

Comments cleanup. Throw exception if column not found in subquery.

a90f06d

andygrove reviewed Jan 30, 2026

View reviewed changes

spark/src/main/scala/org/apache/comet/serde/operator/CometIcebergNativeScan.scala Show resolved Hide resolved

mbutrovich marked this pull request as draft January 30, 2026 21:28

mbutrovich added 7 commits January 30, 2026 16:54

Remove IcebergFilePartition from proto and clean up native code now t…

4ea7c78

…hat every partition doesn't come over.

Use sab.index and sab.buildKeys with exprId matching (handles renamed…

9eb95d7

… columns), fixes TestRuntimeFiltering Iceberg Java tests with column renames. CometIcebergSplitRDD registers subqueries so native code can look them up, fixes TestViews Iceberg Java tests with rewritten filter.

Simplify matching logic for SubqueryAdaptiveBroadcastExec expressions…

0fd297e

…, add more docs.

add shim for Spark 4.0 SAB API change (indices instead of index), add…

8f7b29d

… assertion at index lookup, and defensive fallback if future Spark behavior changes.

add Spark 3.4 shim, whoops

da27f5f

Merge branch 'main' into iceberg-split-serialization-dpp

4af89b2

Merge branch 'main' into iceberg-split-serialization-dpp

97f3693

# Conflicts: # .github/workflows/iceberg_spark_test.yml

mbutrovich changed the title ~~feat: [iceberg-rust] Iceberg native scan does per-partition task serialization with DPP~~ feat: [iceberg] Iceberg native scan does per-partition task serialization with DPP Feb 1, 2026

mbutrovich added 6 commits February 1, 2026 07:36

Comment.

43a56e5

Refactor down to just one CometExecRDD. Let's see how CI goes.

bac6d66

Fix spotless.

e73cca0

Fix broadcast with DPP?

95c4e6d

Minor refactor for variable names, comments.

67c8bdb

Fix scalastyle.

aa048a7

mbutrovich changed the title ~~feat: [iceberg] Iceberg native scan does per-partition task serialization with DPP~~ feat: [iceberg] Refactor CometExecRDD to support per-partition plan data, Iceberg native scan with DPP Feb 1, 2026

mbutrovich changed the title ~~feat: [iceberg] Refactor CometExecRDD to support per-partition plan data, Iceberg native scan with DPP~~ feat: [iceberg] CometExecRDD supports per-partition plan data, Iceberg native scan with DPP Feb 1, 2026

mbutrovich marked this pull request as ready for review February 1, 2026 16:50

cache parsed commonData.

02a52a3

peter-toth reviewed Feb 2, 2026

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scala Outdated Show resolved Hide resolved

Address PR feedback.

9cc541a

mbutrovich mentioned this pull request Feb 2, 2026

feat: [iceberg] Test Parquet metadata caching in iceberg-rust #3365

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: [iceberg] CometExecRDD supports per-partition plan data, Iceberg native scan with DPP #3349

feat: [iceberg] CometExecRDD supports per-partition plan data, Iceberg native scan with DPP #3349

Uh oh!

mbutrovich commented Jan 30, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jan 30, 2026 •

edited

Loading

Uh oh!

andygrove Jan 30, 2026

Uh oh!

Uh oh!

mbutrovich commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: [iceberg] CometExecRDD supports per-partition plan data, Iceberg native scan with DPP #3349

Are you sure you want to change the base?

feat: [iceberg] CometExecRDD supports per-partition plan data, Iceberg native scan with DPP #3349

Uh oh!

Conversation

mbutrovich commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Partition data distribution (solves plan bloat problem)

Unified RDD (replaces ZippedPartitionsRDD)

Dynamic Partition Pruning support

CometNativeExec iterator creation

How are these changes tested?

Uh oh!

codecov-commenter commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbutrovich commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mbutrovich commented Jan 30, 2026 •

edited

Loading

codecov-commenter commented Jan 30, 2026 •

edited

Loading