You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/configuration-properties.md
+41Lines changed: 41 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1262,6 +1262,47 @@ Use [SQLConf.OUTPUT_COMMITTER_CLASS](SQLConf.md#OUTPUT_COMMITTER_CLASS) to acces
1262
1262
!!! note
1263
1263
`ParquetUtils` uses [spark.sql.parquet.output.committer.class](#spark.sql.parquet.output.committer.class) or the default `ParquetOutputCommitter` instead.
Enables bucketing for [connectors](connector/index.md) (_V2 data sources_).
1270
+
1271
+
When enabled, Spark will recognize the specific distribution reported by a V2 data source through [SupportsReportPartitioning](connector/SupportsReportPartitioning.md), and avoid shuffle if necessary.
1272
+
1273
+
Similar to [spark.sql.sources.bucketing.enabled](#spark.sql.sources.bucketing.enabled)
1274
+
1275
+
Use [SQLConf.v2BucketingEnabled](SQLConf.md#v2BucketingEnabled) for the current value
1276
+
1277
+
Used when:
1278
+
1279
+
*`DataSourceV2ScanExecBase` is requested to [groupPartitions](physical-operators/DataSourceV2ScanExecBase.md#groupPartitions)
During a Storage-Partitioned Join, whether to allow input partitions to be partially clustered, when both sides of the join are of `KeyGroupedPartitioning`.
1286
+
1287
+
Default: `false`
1288
+
1289
+
At planning time, Spark will pick the side with less data size based on table statistics, group and replicate them to match the other side.
1290
+
1291
+
This is an optimization on skew join and can help to reduce data skewness when certain partitions are assigned large amount of data.
1292
+
1293
+
Requires both [spark.sql.sources.v2.bucketing.enabled](#spark.sql.sources.v2.bucketing.enabled) and [spark.sql.sources.v2.bucketing.pushPartValues.enabled](#spark.sql.sources.v2.bucketing.pushPartValues.enabled) to be enabled
Whether to pushdown common partition values when [spark.sql.sources.v2.bucketing.enabled](#spark.sql.sources.v2.bucketing.enabled) is enabled.
1300
+
1301
+
Default: `false`
1302
+
1303
+
When enabled, if both sides of a join are of `KeyGroupedPartitioning` and if they share compatible partition keys, even if they don't have the exact same partition values, Spark will calculate a superset of partition values and pushdown that info to scan nodes, which will use empty partitions for the missing partition values on either side.
1304
+
This could help to eliminate unnecessary shuffles.
**(internal)** The number of entires in an in-memory hash map (to store aggregation buffers per grouping keys) before [ObjectHashAggregateExec](physical-operators/ObjectHashAggregateExec.md) ([ObjectAggregationIterator](aggregations/ObjectAggregationIterator.md#processInputs), precisely) falls back to sort-based aggregation
Optional partitioning [expression](../expressions/Expression.md)s (provided by [connectors](../connector/index.md) using [SupportsReportPartitioning](../connector/SupportsReportPartitioning.md))
??? note "Spark Structured Streaming Not Supported"
48
+
`ContinuousScanExec` and `MicroBatchScanExec` physical operators are not supported (and have `keyGroupedPartitioning` undefined (`None`)).
49
+
47
50
Used when:
48
51
49
52
*`DataSourceV2ScanExecBase` is requested to [groupedPartitions](#groupedPartitions), [groupPartitions](#groupPartitions), [outputPartitioning](#outputPartitioning)
`simpleString` is part of the [TreeNode](../catalyst/TreeNode.md#simpleString) abstraction.
142
+
```scala
143
+
outputOrdering: Seq[SortOrder]
144
+
```
143
145
144
-
---
146
+
`outputOrdering` is part of the [QueryPlan](../catalyst/QueryPlan.md#outputOrdering) abstraction.
147
+
148
+
`outputOrdering`...FIXME
149
+
150
+
## Simple Node Description { #simpleString }
151
+
152
+
??? note "TreeNode"
153
+
154
+
```scala
155
+
simpleString(
156
+
maxFields: Int): String
157
+
```
158
+
159
+
`simpleString` is part of the [TreeNode](../catalyst/TreeNode.md#simpleString) abstraction.
145
160
146
161
`simpleString`...FIXME
147
162
148
-
## <spanid="supportsColumnar"> supportsColumnar
163
+
## supportsColumnar { #supportsColumnar }
149
164
150
-
```scala
151
-
supportsColumnar:Boolean
152
-
```
165
+
??? note "SparkPlan"
153
166
154
-
`supportsColumnar` is part of the [SparkPlan](SparkPlan.md#supportsColumnar) abstraction.
167
+
```scala
168
+
supportsColumnar: Boolean
169
+
```
155
170
156
-
---
171
+
`supportsColumnar` is part of the [SparkPlan](SparkPlan.md#supportsColumnar) abstraction.
157
172
158
173
`supportsColumnar` is `true` if the [PartitionReaderFactory](#readerFactory) can [supportColumnarReads](../connector/PartitionReaderFactory.md#supportColumnarReads) for all the [input partitions](#inputPartitions). Otherwise, `supportsColumnar` is `false`.
159
174
@@ -165,7 +180,7 @@ supportsColumnar: Boolean
165
180
Cannot mix row-based and columnar input partitions.
@@ -209,3 +224,58 @@ In the end, `verboseStringWithOperatorId` is as follows (based on [formattedNode
209
224
Output: [output]
210
225
[metaDataStr]
211
226
```
227
+
228
+
## Input Partitions { #partitions }
229
+
230
+
```scala
231
+
partitions:Seq[Seq[InputPartition]]
232
+
```
233
+
234
+
`partitions`...FIXME
235
+
236
+
---
237
+
238
+
`partitions` is used when:
239
+
240
+
*`BatchScanExec` physical operator is requested to [filteredPartitions](BatchScanExec.md#filteredPartitions)
241
+
*`ContinuousScanExec` physical operator ([Spark Structured Streaming]({{ book.structured_streaming }}/physical-operators/ContinuousScanExec)) is requested for the `inputRDD`
242
+
*`MicroBatchScanExec` physical operator ([Spark Structured Streaming]({{ book.structured_streaming }}/physical-operators/MicroBatchScanExec)) is requested for the `inputRDD`
`groupedPartitions` is a Scala **lazy value** to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.
252
+
253
+
Learn more in the [Scala Language Specification]({{ scala.spec }}/05-classes-and-objects.html#lazy).
254
+
255
+
`groupedPartitions` takes the [keyGroupedPartitioning](#keyGroupedPartitioning), if specified, and [group](#groupPartitions) the [input partitions](#inputPartitions).
256
+
257
+
---
258
+
259
+
`groupedPartitions` is used when:
260
+
261
+
*`DataSourceV2ScanExecBase` physical operator is requested for the [output data ordering](#outputOrdering), [output data partitioning requirements](#outputPartitioning), [partitions](#partitions)
`groupPartitions` does nothing (and returns `None`) when called with [spark.sql.sources.v2.bucketing.enabled](../configuration-properties.md#spark.sql.sources.v2.bucketing.enabled) disabled.
273
+
274
+
`groupPartitions`...FIXME
275
+
276
+
---
277
+
278
+
`groupPartitions` is used when:
279
+
280
+
*`BatchScanExec` physical operator is requested for the [filtered input partitions](BatchScanExec.md#filteredPartitions) and [input RDD](BatchScanExec.md#inputRDD)
281
+
*`DataSourceV2ScanExecBase` is requested for the [groupedPartitions](#groupedPartitions)
0 commit comments