perf: Iceberg serde ~50% faster serialization [iceberg] #3298

andygrove · 2026-01-27T15:08:40Z

Summary

Cache Class.forName() and getMethod() reflection calls in a ReflectionCache object
Use object identity for partition spec/type deduplication (only call toJson for new unique specs)
Cache buildFieldIdMapping() results by schema identity

Benchmark Results (30,000 tasks)

Metric	Before	After	Improvement
FileScanTask -> Protobuf (convert)	34,425 ms	16,618 ms	52% faster

Key Optimizations

ReflectionCache: Cache all Iceberg classes and methods once per convert() call instead of per-task
Partition spec identity dedup: Only call PartitionSpecParser.toJson() for new unique spec objects
Partition type identity dedup: Same spec = same partition type, skip JSON building for duplicates
Field ID mapping cache: Cache buildFieldIdMapping() results by schema object identity

The ReflectionCache holds:

Iceberg classes: ContentScanTask, FileScanTask, ContentFile, DeleteFile, SchemaParser, Schema, PartitionSpecParser, PartitionSpec
Methods: file(), start(), length(), partition(), residual(), schema(), deletes(), spec(), location(), content(), specId(), equalityFieldIds(), toJson()

Test plan

Existing Iceberg tests pass
Benchmarked with 30,000 partitions showing 52% improvement

…ization Cache reflection and computation results to avoid redundant work: 1. ReflectionCache: Cache Class.forName() and getMethod() calls once per convert() instead of per-task (30,000+ times) 2. Partition spec deduplication by object identity: Only call toJson() for new unique specs, not for every task 3. Partition type deduplication by spec identity: Same spec = same partition type, so skip JSON building for duplicate specs 4. Field ID mapping cache: Cache buildFieldIdMapping() results by schema identity to avoid repeated reflection per-column Benchmark results (30,000 tasks): - Original: 34,425 ms per 100 iterations - After caching: 16,618 ms per 100 iterations - Improvement: 52% faster Co-Authored-By: Claude Opus 4.5 <[email protected]>

mbutrovich · 2026-01-27T15:42:23Z

Don't we have an IcebergReflection helper? It seems like we should try to encapsulate this logic there.

codecov-commenter · 2026-01-27T15:52:24Z

Codecov Report

❌ Patch coverage is 88.58696% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.09%. Comparing base (f09f8af) to head (c168832).
⚠️ Report is 901 commits behind head on main.

Files with missing lines	Patch %	Lines
.../comet/serde/operator/CometIcebergNativeScan.scala	76.19%	11 Missing and 9 partials ⚠️
...a/org/apache/comet/iceberg/IcebergReflection.scala	99.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3298      +/-   ##
============================================
+ Coverage     56.12%   60.09%   +3.96%     
- Complexity      976     1473     +497     
============================================
  Files           119      175      +56     
  Lines         11743    16246    +4503     
  Branches       2251     2688     +437     
============================================
+ Hits           6591     9763    +3172     
- Misses         4012     5131    +1119     
- Partials       1140     1352     +212

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Move the ReflectionCache case class and createReflectionCache() method from CometIcebergNativeScan to the IcebergReflection helper class per code review feedback. This encapsulates all Iceberg reflection caching logic in the shared reflection utilities. Co-Authored-By: Claude Opus 4.5 <[email protected]>

andygrove · 2026-01-27T16:11:43Z

Don't we have an IcebergReflection helper? It seems like we should try to encapsulate this logic there.

Thanks. I pushed another commit to do this.

mbutrovich · 2026-01-27T19:23:29Z

spark/src/main/scala/org/apache/comet/serde/operator/CometIcebergNativeScan.scala

+                  if (fieldTypeStr == IcebergReflection.TypeNames.UNKNOWN) {
+                    None
+                  } else {
+                    val fieldIdMethod = field.getClass.getMethod("fieldId")


It looks like some of these still aren't being cached?

Thanks. Pushed some more in c168832

Add caching for partition data extraction methods that were still being looked up per-field: - PartitionSpec.partitionType() - StructType.fields() - NestedField.type(), fieldId(), name(), isOptional() - StructLike.get(int, Class<?>) These methods are called for every partition field in every task, so caching them provides significant speedup. Co-Authored-By: Claude Opus 4.5 <[email protected]>

mbutrovich

This is a good PR. Even if it only accelerates things on the driver, it gives an opportunity to revisit the reflection code. In general, I am a little concerned about some semantics I think we should enforce. Namely, if reflection fails, we should fall back to Spark. I think there are code paths where reflection can fail, but we'll proceed with serializing a scan with possibly missing fields. This is not necessarily all in the scope of the changes in this PR, but maybe we can include them/revisit here, or we should do a quick followup. What do you think @andygrove? If it isn't urgent to merge this, it might be good to do it here.

mbutrovich · 2026-01-28T15:03:07Z

spark/src/main/scala/org/apache/comet/serde/operator/CometIcebergNativeScan.scala

              val inputPartClass = inputPartition.getClass

              try {


Does it make sense to try to encapsulate these remaining reflection calls in the cache as well?

mbutrovich · 2026-01-28T15:07:12Z

spark/src/main/scala/org/apache/comet/serde/operator/CometIcebergNativeScan.scala

        } catch {
          case e: Exception =>
            logWarning(s"Failed to serialize delete file: ${e.getMessage}")
            None


I am concerned about some of the nested try-catch logic here. If we fail to serialize delete files, we should propagate an exception to make sure we fall back to Spark. IIUC, this code will return None for delete files, which will results in a serialize scan that will not apply delete files and potentially generate invalid data.

Most of the functions in IcebergReflection might be okay, but buildFieldIdMapping and extractDeleteFilesList should probably throw at a minimum.

andygrove force-pushed the iceberg-reflection-caching branch from 61d79ec to bf83e76 Compare January 27, 2026 15:25

andygrove changed the title ~~perf: Cache reflection lookups in Iceberg serde for 24% faster serialization~~ perf: Cache reflection lookups in Iceberg serde for ~50% faster serialization Jan 27, 2026

andygrove changed the title ~~perf: Cache reflection lookups in Iceberg serde for ~50% faster serialization~~ perf: Iceberg serde ~50% faster serialization Jan 27, 2026

andygrove marked this pull request as draft January 27, 2026 15:45

andygrove changed the title ~~perf: Iceberg serde ~50% faster serialization~~ perf: Iceberg serde ~50% faster serialization [iceberg] Jan 27, 2026

chore: trigger CI

ab020dc

andygrove marked this pull request as ready for review January 27, 2026 18:20

mbutrovich self-requested a review January 27, 2026 19:06

mbutrovich reviewed Jan 27, 2026

View reviewed changes

mbutrovich self-requested a review January 28, 2026 14:35

mbutrovich requested changes Jan 28, 2026

View reviewed changes

andygrove marked this pull request as draft January 28, 2026 16:59

andygrove mentioned this pull request Jan 30, 2026

feat: [iceberg] CometExecRDD supports per-partition plan data, Iceberg native scan with DPP #3349

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Iceberg serde ~50% faster serialization [iceberg] #3298

perf: Iceberg serde ~50% faster serialization [iceberg] #3298

Uh oh!

andygrove commented Jan 27, 2026 •

edited

Loading

Uh oh!

mbutrovich commented Jan 27, 2026

Uh oh!

codecov-commenter commented Jan 27, 2026 •

edited

Loading

Uh oh!

andygrove commented Jan 27, 2026

Uh oh!

mbutrovich Jan 27, 2026

Uh oh!

andygrove Jan 27, 2026

Uh oh!

mbutrovich left a comment

Uh oh!

mbutrovich Jan 28, 2026

Uh oh!

mbutrovich Jan 28, 2026

Uh oh!

mbutrovich Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf: Iceberg serde ~50% faster serialization [iceberg] #3298

Are you sure you want to change the base?

perf: Iceberg serde ~50% faster serialization [iceberg] #3298

Uh oh!

Conversation

andygrove commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark Results (30,000 tasks)

Key Optimizations

Test plan

Uh oh!

mbutrovich commented Jan 27, 2026

Uh oh!

codecov-commenter commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove commented Jan 27, 2026

Uh oh!

mbutrovich Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

mbutrovich Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andygrove commented Jan 27, 2026 •

edited

Loading

codecov-commenter commented Jan 27, 2026 •

edited

Loading