DataSourceV2ScanExecBase and "v2 Bucketing"

jaceklaskowski · jaceklaskowski · commit 64098850758e · 2024-05-25T15:00:52.000+02:00
diff --git a/docs/configuration-properties.md b/docs/configuration-properties.md
@@ -1262,6 +1262,47 @@ Use [SQLConf.OUTPUT_COMMITTER_CLASS](SQLConf.md#OUTPUT_COMMITTER_CLASS) to acces
 !!! note
     `ParquetUtils` uses [spark.sql.parquet.output.committer.class](#spark.sql.parquet.output.committer.class) or the default `ParquetOutputCommitter` instead.
 
+### <span id="V2_BUCKETING_ENABLED"> v2.bucketing.enabled { #spark.sql.sources.v2.bucketing.enabled }
+
+**spark.sql.sources.v2.bucketing.enabled**
+
+Enables bucketing for [connectors](connector/index.md) (_V2 data sources_).
+
+When enabled, Spark will recognize the specific distribution reported by a V2 data source through [SupportsReportPartitioning](connector/SupportsReportPartitioning.md), and avoid shuffle if necessary.
+
+Similar to [spark.sql.sources.bucketing.enabled](#spark.sql.sources.bucketing.enabled)
+
+Use [SQLConf.v2BucketingEnabled](SQLConf.md#v2BucketingEnabled) for the current value
+
+Used when:
+
+* `DataSourceV2ScanExecBase` is requested to [groupPartitions](physical-operators/DataSourceV2ScanExecBase.md#groupPartitions)
+
+### <span id="V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED"> v2.bucketing.partiallyClusteredDistribution.enabled { #spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled }
+
+**spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled**
+
+During a Storage-Partitioned Join, whether to allow input partitions to be partially clustered, when both sides of the join are of `KeyGroupedPartitioning`.
+
+Default: `false`
+
+At planning time, Spark will pick the side with less data size based on table statistics, group and replicate them to match the other side.
+
+This is an optimization on skew join and can help to reduce data skewness when certain partitions are assigned large amount of data.
+
+Requires both [spark.sql.sources.v2.bucketing.enabled](#spark.sql.sources.v2.bucketing.enabled) and [spark.sql.sources.v2.bucketing.pushPartValues.enabled](#spark.sql.sources.v2.bucketing.pushPartValues.enabled) to be enabled
+
+### <span id="V2_BUCKETING_PUSH_PART_VALUES_ENABLED"> v2.bucketing.pushPartValues.enabled { #spark.sql.sources.v2.bucketing.pushPartValues.enabled }
+
+**spark.sql.sources.v2.bucketing.pushPartValues.enabled**
+
+Whether to pushdown common partition values when [spark.sql.sources.v2.bucketing.enabled](#spark.sql.sources.v2.bucketing.enabled) is enabled.
+
+Default: `false`
+
+When enabled, if both sides of a join are of `KeyGroupedPartitioning` and if they share compatible partition keys, even if they don't have the exact same partition values, Spark will calculate a superset of partition values and pushdown that info to scan nodes, which will use empty partitions for the missing partition values on either side.
+This could help to eliminate unnecessary shuffles.
+
 ## <span id="spark.sql.objectHashAggregate.sortBased.fallbackThreshold"> spark.sql.objectHashAggregate.sortBased.fallbackThreshold
 
 **(internal)** The number of entires in an in-memory hash map (to store aggregation buffers per grouping keys) before [ObjectHashAggregateExec](physical-operators/ObjectHashAggregateExec.md) ([ObjectAggregationIterator](aggregations/ObjectAggregationIterator.md#processInputs), precisely) falls back to sort-based aggregation
diff --git a/docs/connector/index.md b/docs/connector/index.md
@@ -1,6 +1,6 @@
 # Connector API
 
-**Connector API** is a new API in Spark 3 for Spark SQL developers to create [connectors](../connectors/index.md) (_data sources_ or _providers_).
+**Connector API** is a new API in Spark 3 for Spark SQL developers to create [connectors](../connectors/index.md) (_V2 data sources_ or _providers_).
 
 !!! note
     Connector API is meant to replace the older (deprecated) DataSource v1 and v2.
diff --git a/docs/physical-operators/DataSourceV2ScanExecBase.md b/docs/physical-operators/DataSourceV2ScanExecBase.md
@@ -32,18 +32,21 @@ See:
 
 * [BatchScanExec](BatchScanExec.md#inputRDD)
 
-### keyGroupedPartitioning { #keyGroupedPartitioning }
+### Custom Partitioning Expressions { #keyGroupedPartitioning }
 
 ```scala
 keyGroupedPartitioning: Option[Seq[Expression]]
 ```
 
-Optional partitioning [expression](../expressions/Expression.md)s
+Optional partitioning [expression](../expressions/Expression.md)s (provided by [connectors](../connector/index.md) using [SupportsReportPartitioning](../connector/SupportsReportPartitioning.md))
 
 See:
 
 * [BatchScanExec](BatchScanExec.md#keyGroupedPartitioning)
 
+??? note "Spark Structured Streaming Not Supported"
+    `ContinuousScanExec` and `MicroBatchScanExec` physical operators are not supported (and have `keyGroupedPartitioning` undefined (`None`)).
+
 Used when:
 
 * `DataSourceV2ScanExecBase` is requested to [groupedPartitions](#groupedPartitions), [groupPartitions](#groupPartitions), [outputPartitioning](#outputPartitioning)
@@ -80,80 +83,92 @@ scan: Scan
 * `ContinuousScanExec` ([Spark Structured Streaming]({{ book.structured_streaming }}/physical-operators/ContinuousScanExec))
 * `MicroBatchScanExec` ([Spark Structured Streaming]({{ book.structured_streaming }}/physical-operators/MicroBatchScanExec))
 
-## <span id="doExecute"> Executing Physical Operator
+## Executing Physical Operator { #doExecute }
 
-```scala
-doExecute(): RDD[InternalRow]
-```
+??? note "SparkPlan"
 
-`doExecute` is part of the [SparkPlan](SparkPlan.md#doExecute) abstraction.
+    ```scala
+    doExecute(): RDD[InternalRow]
+    ```
 
----
+    `doExecute` is part of the [SparkPlan](SparkPlan.md#doExecute) abstraction.
 
 `doExecute`...FIXME
 
-## <span id="doExecuteColumnar"> doExecuteColumnar
+## doExecuteColumnar { #doExecuteColumnar }
 
-```scala
-doExecuteColumnar(): RDD[ColumnarBatch]
-```
+??? note "SparkPlan"
 
-`doExecuteColumnar` is part of the [SparkPlan](SparkPlan.md#doExecuteColumnar) abstraction.
+    ```scala
+    doExecuteColumnar(): RDD[ColumnarBatch]
+    ```
 
----
+    `doExecuteColumnar` is part of the [SparkPlan](SparkPlan.md#doExecuteColumnar) abstraction.
 
 `doExecuteColumnar`...FIXME
 
-## <span id="metrics"> Performance Metrics
+## Performance Metrics { #metrics }
 
-```scala
-metrics: Map[String, SQLMetric]
-```
+??? note "SparkPlan"
 
-`metrics` is part of the [SparkPlan](SparkPlan.md#metrics) abstraction.
+    ```scala
+    metrics: Map[String, SQLMetric]
+    ```
 
----
+    `metrics` is part of the [SparkPlan](SparkPlan.md#metrics) abstraction.
 
 `metrics` is the following [SQLMetric](../SQLMetric.md)s with the [customMetrics](#customMetrics):
 
 Metric Name | web UI
 ------------|--------
  `numOutputRows` | number of output rows
 
-## <span id="outputPartitioning"> Output Data Partitioning Requirements
+## Output Data Partitioning { #outputPartitioning }
 
-```scala
-outputPartitioning: physical.Partitioning
-```
+??? note "SparkPlan"
 
-`outputPartitioning` is part of the [SparkPlan](SparkPlan.md#outputPartitioning) abstraction.
+    ```scala
+    outputPartitioning: physical.Partitioning
+    ```
 
----
+    `outputPartitioning` is part of the [SparkPlan](SparkPlan.md#outputPartitioning) abstraction.
 
 `outputPartitioning`...FIXME
 
-## <span id="simpleString"> Simple Node Description
+## Output Data Ordering { #outputOrdering }
 
-```scala
-simpleString(
-    maxFields: Int): String
-```
+??? note "QueryPlan"
 
-`simpleString` is part of the [TreeNode](../catalyst/TreeNode.md#simpleString) abstraction.
+    ```scala
+    outputOrdering: Seq[SortOrder]
+    ```
 
----
+    `outputOrdering` is part of the [QueryPlan](../catalyst/QueryPlan.md#outputOrdering) abstraction.
+
+`outputOrdering`...FIXME
+
+## Simple Node Description { #simpleString }
+
+??? note "TreeNode"
+
+    ```scala
+    simpleString(
+        maxFields: Int): String
+    ```
+
+    `simpleString` is part of the [TreeNode](../catalyst/TreeNode.md#simpleString) abstraction.
 
 `simpleString`...FIXME
 
-## <span id="supportsColumnar"> supportsColumnar
+## supportsColumnar { #supportsColumnar }
 
-```scala
-supportsColumnar: Boolean
-```
+??? note "SparkPlan"
 
-`supportsColumnar` is part of the [SparkPlan](SparkPlan.md#supportsColumnar) abstraction.
+    ```scala
+    supportsColumnar: Boolean
+    ```
 
----
+    `supportsColumnar` is part of the [SparkPlan](SparkPlan.md#supportsColumnar) abstraction.
 
 `supportsColumnar` is `true` if the [PartitionReaderFactory](#readerFactory) can [supportColumnarReads](../connector/PartitionReaderFactory.md#supportColumnarReads) for all the [input partitions](#inputPartitions). Otherwise, `supportsColumnar` is `false`.
 
@@ -165,7 +180,7 @@ supportsColumnar: Boolean
 Cannot mix row-based and columnar input partitions.
 ```
 
-## <span id="customMetrics"> Custom Metrics
+## Custom Metrics { #customMetrics }
 
 ```scala
 customMetrics: Map[String, SQLMetric]
@@ -187,9 +202,9 @@ customMetrics: Map[String, SQLMetric]
 * `ContinuousScanExec` is requested for the `inputRDD`
 * `MicroBatchScanExec` is requested for the `inputRDD` (that creates a [DataSourceRDD](../DataSourceRDD.md))
 
-## <span id="verboseStringWithOperatorId"> verboseStringWithOperatorId
+## verboseStringWithOperatorId { #verboseStringWithOperatorId }
 
-??? note "Signature"
+??? note "QueryPlan"
 
     ```scala
     verboseStringWithOperatorId(): String
@@ -209,3 +224,58 @@ In the end, `verboseStringWithOperatorId` is as follows (based on [formattedNode
 Output: [output]
 [metaDataStr]
 ```
+
+## Input Partitions { #partitions }
+
+```scala
+partitions: Seq[Seq[InputPartition]]
+```
+
+`partitions`...FIXME
+
+---
+
+`partitions` is used when:
+
+* `BatchScanExec` physical operator is requested to [filteredPartitions](BatchScanExec.md#filteredPartitions)
+* `ContinuousScanExec` physical operator ([Spark Structured Streaming]({{ book.structured_streaming }}/physical-operators/ContinuousScanExec)) is requested for the `inputRDD`
+* `MicroBatchScanExec` physical operator ([Spark Structured Streaming]({{ book.structured_streaming }}/physical-operators/MicroBatchScanExec)) is requested for the `inputRDD`
+
+## groupedPartitions { #groupedPartitions }
+
+```scala
+groupedPartitions: Option[Seq[(InternalRow, Seq[InputPartition])]]
+```
+
+??? note "Lazy Value"
+    `groupedPartitions` is a Scala **lazy value** to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.
+
+    Learn more in the [Scala Language Specification]({{ scala.spec }}/05-classes-and-objects.html#lazy).
+
+`groupedPartitions` takes the [keyGroupedPartitioning](#keyGroupedPartitioning), if specified, and [group](#groupPartitions) the [input partitions](#inputPartitions).
+
+---
+
+`groupedPartitions` is used when:
+
+* `DataSourceV2ScanExecBase` physical operator is requested for the [output data ordering](#outputOrdering), [output data partitioning requirements](#outputPartitioning), [partitions](#partitions)
+
+## groupPartitions { #groupPartitions }
+
+```scala
+groupPartitions(
+  inputPartitions: Seq[InputPartition],
+  groupSplits: Boolean = !conf.v2BucketingPushPartValuesEnabled || !conf.v2BucketingPartiallyClusteredDistributionEnabled): Option[Seq[(InternalRow, Seq[InputPartition])]]
+```
+
+!!! note "Noop"
+    `groupPartitions` does nothing (and returns `None`) when called with [spark.sql.sources.v2.bucketing.enabled](../configuration-properties.md#spark.sql.sources.v2.bucketing.enabled) disabled.
+
+`groupPartitions`...FIXME
+
+---
+
+`groupPartitions` is used when:
+
+* `BatchScanExec` physical operator is requested for the [filtered input partitions](BatchScanExec.md#filteredPartitions) and [input RDD](BatchScanExec.md#inputRDD)
+* `DataSourceV2ScanExecBase` is requested for the [groupedPartitions](#groupedPartitions)