Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 27, 2026

Summary

  • Cache Class.forName() and getMethod() reflection calls in a ReflectionCache object
  • Use object identity for partition spec/type deduplication (only call toJson for new unique specs)
  • Cache buildFieldIdMapping() results by schema identity

Benchmark Results (30,000 tasks)

Metric Before After Improvement
FileScanTask -> Protobuf (convert) 34,425 ms 16,618 ms 52% faster

Key Optimizations

  1. ReflectionCache: Cache all Iceberg classes and methods once per convert() call instead of per-task
  2. Partition spec identity dedup: Only call PartitionSpecParser.toJson() for new unique spec objects
  3. Partition type identity dedup: Same spec = same partition type, skip JSON building for duplicates
  4. Field ID mapping cache: Cache buildFieldIdMapping() results by schema object identity

The ReflectionCache holds:

  • Iceberg classes: ContentScanTask, FileScanTask, ContentFile, DeleteFile, SchemaParser, Schema, PartitionSpecParser, PartitionSpec
  • Methods: file(), start(), length(), partition(), residual(), schema(), deletes(), spec(), location(), content(), specId(), equalityFieldIds(), toJson()

Test plan

  • Existing Iceberg tests pass
  • Benchmarked with 30,000 partitions showing 52% improvement

…ization

Cache reflection and computation results to avoid redundant work:

1. ReflectionCache: Cache Class.forName() and getMethod() calls once per
   convert() instead of per-task (30,000+ times)

2. Partition spec deduplication by object identity: Only call toJson() for
   new unique specs, not for every task

3. Partition type deduplication by spec identity: Same spec = same partition
   type, so skip JSON building for duplicate specs

4. Field ID mapping cache: Cache buildFieldIdMapping() results by schema
   identity to avoid repeated reflection per-column

Benchmark results (30,000 tasks):
- Original:        34,425 ms per 100 iterations
- After caching:   16,618 ms per 100 iterations
- Improvement:     52% faster

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@andygrove andygrove force-pushed the iceberg-reflection-caching branch from 61d79ec to bf83e76 Compare January 27, 2026 15:25
@andygrove andygrove changed the title perf: Cache reflection lookups in Iceberg serde for 24% faster serialization perf: Cache reflection lookups in Iceberg serde for ~50% faster serialization Jan 27, 2026
@andygrove andygrove changed the title perf: Cache reflection lookups in Iceberg serde for ~50% faster serialization perf: Iceberg serde ~50% faster serialization Jan 27, 2026
@mbutrovich
Copy link
Contributor

Don't we have an IcebergReflection helper? It seems like we should try to encapsulate this logic there.

@andygrove andygrove marked this pull request as draft January 27, 2026 15:45
@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 88.58696% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.09%. Comparing base (f09f8af) to head (c168832).
⚠️ Report is 901 commits behind head on main.

Files with missing lines Patch % Lines
.../comet/serde/operator/CometIcebergNativeScan.scala 76.19% 11 Missing and 9 partials ⚠️
...a/org/apache/comet/iceberg/IcebergReflection.scala 99.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3298      +/-   ##
============================================
+ Coverage     56.12%   60.09%   +3.96%     
- Complexity      976     1473     +497     
============================================
  Files           119      175      +56     
  Lines         11743    16246    +4503     
  Branches       2251     2688     +437     
============================================
+ Hits           6591     9763    +3172     
- Misses         4012     5131    +1119     
- Partials       1140     1352     +212     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Move the ReflectionCache case class and createReflectionCache() method
from CometIcebergNativeScan to the IcebergReflection helper class per
code review feedback. This encapsulates all Iceberg reflection caching
logic in the shared reflection utilities.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@andygrove
Copy link
Member Author

Don't we have an IcebergReflection helper? It seems like we should try to encapsulate this logic there.

Thanks. I pushed another commit to do this.

@andygrove andygrove changed the title perf: Iceberg serde ~50% faster serialization perf: Iceberg serde ~50% faster serialization [iceberg] Jan 27, 2026
@andygrove andygrove marked this pull request as ready for review January 27, 2026 18:20
@mbutrovich mbutrovich self-requested a review January 27, 2026 19:06
if (fieldTypeStr == IcebergReflection.TypeNames.UNKNOWN) {
None
} else {
val fieldIdMethod = field.getClass.getMethod("fieldId")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like some of these still aren't being cached?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Pushed some more in c168832

Add caching for partition data extraction methods that were still
being looked up per-field:
- PartitionSpec.partitionType()
- StructType.fields()
- NestedField.type(), fieldId(), name(), isOptional()
- StructLike.get(int, Class<?>)

These methods are called for every partition field in every task,
so caching them provides significant speedup.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@mbutrovich mbutrovich self-requested a review January 28, 2026 14:35
Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good PR. Even if it only accelerates things on the driver, it gives an opportunity to revisit the reflection code. In general, I am a little concerned about some semantics I think we should enforce. Namely, if reflection fails, we should fall back to Spark. I think there are code paths where reflection can fail, but we'll proceed with serializing a scan with possibly missing fields. This is not necessarily all in the scope of the changes in this PR, but maybe we can include them/revisit here, or we should do a quick followup. What do you think @andygrove? If it isn't urgent to merge this, it might be good to do it here.

Comment on lines 721 to 723
val inputPartClass = inputPartition.getClass

try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to try to encapsulate these remaining reflection calls in the cache as well?

} catch {
case e: Exception =>
logWarning(s"Failed to serialize delete file: ${e.getMessage}")
None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about some of the nested try-catch logic here. If we fail to serialize delete files, we should propagate an exception to make sure we fall back to Spark. IIUC, this code will return None for delete files, which will results in a serialize scan that will not apply delete files and potentially generate invalid data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the functions in IcebergReflection might be okay, but buildFieldIdMapping and extractDeleteFilesList should probably throw at a minimum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants