You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/SQLConf.md
+8Lines changed: 8 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1053,6 +1053,14 @@ Used when `CacheManager` is requested to [cache a structured query](CacheManager
1053
1053
1054
1054
Used when [Aggregation](execution-planning-strategies/Aggregation.md) execution planning strategy is executed (and uses `AggUtils` to [create an aggregation physical operator](aggregations/AggUtils.md#createAggregate)).
During a Storage-Partitioned Join, whether to allow input partitions to be partially clustered, when both sides of the join are of `KeyGroupedPartitioning`.
1285
+
During a [Storage-Partitioned Join](storage-partitioned-joins/index.md), whether to allow input partitions to be partially clustered, when both sides of the join are of [KeyGroupedPartitioning](connector/KeyGroupedPartitioning.md).
1286
1286
1287
1287
Default: `false`
1288
1288
@@ -1292,6 +1292,14 @@ This is an optimization on skew join and can help to reduce data skewness when c
1292
1292
1293
1293
Requires both [spark.sql.sources.v2.bucketing.enabled](#spark.sql.sources.v2.bucketing.enabled) and [spark.sql.sources.v2.bucketing.pushPartValues.enabled](#spark.sql.sources.v2.bucketing.pushPartValues.enabled) to be enabled
1294
1294
1295
+
Use [SQLConf.v2BucketingPartiallyClusteredDistributionEnabled](SQLConf.md#v2BucketingPartiallyClusteredDistributionEnabled) for the current value
1296
+
1297
+
Used when:
1298
+
1299
+
*`BatchScanExec` physical operator is requested for the [input RDD](physical-operators/BatchScanExec.md#inputRDD)
1300
+
*`DataSourceV2ScanExecBase` physical operator is requested for [groupPartitions](physical-operators/DataSourceV2ScanExecBase.md#groupPartitions)
1301
+
*[EnsureRequirements](physical-optimizations/EnsureRequirements.md) physical optimization is executed (to [checkKeyGroupCompatible](physical-optimizations/EnsureRequirements.md#checkKeyGroupCompatible))
When enabled, if both sides of a join are of `KeyGroupedPartitioning` and if they share compatible partition keys, even if they don't have the exact same partition values, Spark will calculate a superset of partition values and pushdown that info to scan nodes, which will use empty partitions for the missing partition values on either side.
1304
1312
This could help to eliminate unnecessary shuffles.
1305
1313
1314
+
Use [SQLConf.v2BucketingPushPartValuesEnabled](SQLConf.md#v2BucketingPushPartValuesEnabled) for the current value
1315
+
1316
+
Used when:
1317
+
1318
+
*`DataSourceV2ScanExecBase` physical operator is requested to [groupPartitions](physical-operators/DataSourceV2ScanExecBase.md#groupPartitions)
1319
+
*`BatchScanExec` physical operator is requested for the [inputRDD](physical-operators/BatchScanExec.md#inputRDD)
1320
+
*`EnsureRequirements` physical optimization is requested to [checkKeyGroupCompatible](physical-optimizations/EnsureRequirements.md#checkKeyGroupCompatible)
**(internal)** The number of entires in an in-memory hash map (to store aggregation buffers per grouping keys) before [ObjectHashAggregateExec](physical-operators/ObjectHashAggregateExec.md) ([ObjectAggregationIterator](aggregations/ObjectAggregationIterator.md#processInputs), precisely) falls back to sort-based aggregation
Copy file name to clipboardExpand all lines: docs/physical-operators/SortMergeJoinExec.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,7 @@
1
+
---
2
+
title: SortMergeJoinExec
3
+
---
4
+
1
5
# SortMergeJoinExec Physical Operator
2
6
3
7
`SortMergeJoinExec` is a [shuffle-based join physical operator](ShuffledJoin.md) for [sort-merge join](#doExecute) (with the [left join keys](#leftKeys) being [orderable](../expressions/RowOrdering.md#isorderable)).
1. Uses `JoinType` of either [SortMergeJoinExec](../physical-operators/SortMergeJoinExec.md) or [ShuffledHashJoinExec](../physical-operators/ShuffledHashJoinExec.md) physical operator
102
+
103
+
!!! note
104
+
Only [SortMergeJoinExec](../physical-operators/SortMergeJoinExec.md) and [ShuffledHashJoinExec](../physical-operators/ShuffledHashJoinExec.md) physical operators are considered.
105
+
106
+
`checkKeyGroupCompatible`...FIXME
107
+
86
108
## OptimizeSkewedJoin { #OptimizeSkewedJoin }
87
109
88
110
`EnsureRequirements` is used to create a [OptimizeSkewedJoin](OptimizeSkewedJoin.md) physical optimization.
**Storage-Partitioned Joins** (_SPJ_) are a new type of [join](../joins.md) in Spark SQL that use the existing storage layout for a partitioned join to avoid expensive shuffles (similarly to [Bucketing](../bucketing/index.md)).
3
+
**Storage-Partitioned Join** (_SPJ_) is a new type of [join](../joins.md) in Spark SQL that uses the existing storage layout for a partitioned join to avoid expensive shuffles (similarly to [Bucketing](../bucketing/index.md)).
4
4
5
5
!!! note
6
6
Storage-Partitioned Joins feature was added in Apache Spark 3.3.0 ([\[SPARK-37375\] Umbrella: Storage Partitioned Join (SPJ)]({{ spark.jira }}/SPARK-37375)).
7
7
8
-
Storage-Partitioned Join is meant mainly, if not exclusively, for [Spark SQL connectors](../connector/index.md) (_v2 data sources_).
8
+
Storage-Partitioned Join is based on [KeyGroupedPartitioning](../connector/KeyGroupedPartitioning.md) to determine partitions.
9
+
10
+
Out of the available built-in [DataSourceV2ScanExecBase](../physical-operators/DataSourceV2ScanExecBase.md) physical operators, only [BatchScanExec](../physical-operators/BatchScanExec.md) supports storage-partitioned joins.
11
+
12
+
Storage-Partitioned Join is meant for [Spark SQL connectors](../connector/index.md) (yet there are none built-in at the moment).
9
13
10
14
Storage-Partitioned Join was proposed in this [SPIP](https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE).
11
15
12
-
Storage-Partitioned Join uses [KeyGroupedPartitioning](../connector/KeyGroupedPartitioning.md) to determine partitions.
16
+
!!! note
17
+
It [appears](../physical-optimizations/EnsureRequirements.md#checkKeyGroupCompatible) that [SortMergeJoinExec](../physical-operators/SortMergeJoinExec.md) and [ShuffledHashJoinExec](../physical-operators/ShuffledHashJoinExec.md) physical operator are the only candidates for Storage-Partitioned Joins.
0 commit comments