BloomFilterAggregate and Bloom Filter Join

jaceklaskowski · jaceklaskowski · commit d0b20e712995 · 2023-06-30T17:47:34.000-07:00
diff --git a/docs/aggregations/SortBasedAggregationIterator.md b/docs/aggregations/SortBasedAggregationIterator.md
@@ -29,7 +29,7 @@
 initialize(): Unit
 ```
 
-!!! note "Procedure"
+!!! warning "Procedure"
     `initialize` returns `Unit` (_nothing_) and whatever happens inside stays inside (just like in Las Vegas, _doesn't it?!_ 😉)
 
 `initialize`...FIXME
diff --git a/docs/expressions/BloomFilterAggregate.md b/docs/expressions/BloomFilterAggregate.md
@@ -20,13 +20,13 @@ title: BloomFilterAggregate
 
 * `InjectRuntimeFilter` logical optimization is requested to [inject a BloomFilter](../logical-optimizations/InjectRuntimeFilter.md#injectBloomFilter)
 
-### <span id="estimatedNumItemsExpression"> Estimated Number of Items Expression
+### Estimated Number of Items Expression { #estimatedNumItemsExpression }
 
 `BloomFilterAggregate` can be given **Estimated Number of Items** (as an [Expression](Expression.md)) when [created](#creating-instance).
 
 Unless given, `BloomFilterAggregate` uses [spark.sql.optimizer.runtime.bloomFilter.expectedNumItems](../configuration-properties.md#spark.sql.optimizer.runtime.bloomFilter.expectedNumItems) configuration property.
 
-### <span id="numBitsExpression"> Number of Bits Expression
+### Number of Bits Expression { #numBitsExpression }
 
 `BloomFilterAggregate` can be given **Number of Bits** (as an [Expression](Expression.md)) when [created](#creating-instance).
 
@@ -36,7 +36,7 @@ The maximum value for the number of bits is [spark.sql.optimizer.runtime.bloomFi
 
 The number of bits expression is the [third](#third) expression (in this `TernaryLike` tree node).
 
-## <span id="numBits"> Number of Bits
+## Number of Bits { #numBits }
 
 ```scala
 numBits: Long
@@ -76,4 +76,22 @@ The `numBits` value [must be a positive value](#checkInputDataTypes).
 
     `eval` is part of the [TypedImperativeAggregate](TypedImperativeAggregate.md#eval) abstraction.
 
-`eval`...FIXME
+`eval` [serializes](#serialize) the given `buffer` (unless the [cardinality](../bloom-filter-join/BloomFilter.md#cardinality) of this `BloomFilter` is `0` and `eval` returns `null`).
+
+??? note "FIXME Why does `eval` return `null`?"
+
+## Serializing Aggregate Buffer { #serialize }
+
+??? note "TypedImperativeAggregate"
+
+    ```scala
+    serialize(
+      obj: BloomFilter): Array[Byte]
+    ```
+
+    `serialize` is part of the [TypedImperativeAggregate](TypedImperativeAggregate.md#serialize) abstraction.
+
+??? note "Two `serialize`s"
+    There is another `serialize` (in `BloomFilterAggregate` companion object) that just makes unit testing easier.
+
+`serialize`...FIXME
diff --git a/docs/expressions/TypedImperativeAggregate.md b/docs/expressions/TypedImperativeAggregate.md
@@ -74,7 +74,7 @@ Used when:
 
 * `TypedImperativeAggregate` is requested to [merge](#merge-Expression) and [mergeBuffersObjects](#mergeBuffersObjects)
 
-### serialize { #serialize }
+### Serializing Aggregate Buffer { #serialize }
 
 ```scala
 serialize(
@@ -88,7 +88,7 @@ See:
 
 Used when:
 
-* `TypedImperativeAggregate` is requested to [serializeAggregateBufferInPlace](#serializeAggregateBufferInPlace)
+* `TypedImperativeAggregate` is requested to [serialize the aggregate buffer in-place](#serializeAggregateBufferInPlace)
 
 ### update { #update }
 
@@ -126,7 +126,7 @@ Used when:
 
 `eval` [take the buffer object](#getBufferObject) out of the given [InternalRow](../InternalRow.md) and [evaluates the result](#eval).
 
-## Aggregation Buffer Attributes { #aggBufferAttributes }
+## Aggregate Buffer Attributes { #aggBufferAttributes }
 
 ??? note "AggregateFunction"
 
@@ -142,7 +142,7 @@ Used when:
 -------|---------
  `buf` | [BinaryType](../types/DataType.md#BinaryType)
 
-## Accessing Buffer Object { #getBufferObject }
+## Extracting Aggregate Buffer Object { #getBufferObject }
 
 ```scala
 getBufferObject(
@@ -171,3 +171,25 @@ anyObjectType: ObjectType
 When created, `TypedImperativeAggregate` creates an `ObjectType` of a value of Scala `AnyRef` type.
 
 The `ObjectType` is used in [getBufferObject](#getBufferObject).
+
+## Serializing Aggregate Buffer In-Place { #serializeAggregateBufferInPlace }
+
+```scala
+serializeAggregateBufferInPlace(
+  buffer: InternalRow): Unit
+```
+
+??? warning "Procedure"
+    `serializeAggregateBufferInPlace` is a procedure (returns `Unit`) so _whatever happens inside, stays inside_ (paraphrasing the [former advertising slogan of Las Vegas, Nevada](https://idioms.thefreedictionary.com/what+happens+in+Vegas+stays+in+Vegas)).
+
+`serializeAggregateBufferInPlace` [gets the aggregate buffer](#getBufferObject) from the given `buffer` and [serializes it](#serialize).
+
+In the end, `serializeAggregateBufferInPlace` stores the serialized aggregate buffer back to the given `buffer` at [mutableAggBufferOffset](ImperativeAggregate.md#mutableAggBufferOffset).
+
+---
+
+`serializeAggregateBufferInPlace` is used when:
+
+* `AggregatingAccumulator` is requested to [withBufferSerialized](../AggregatingAccumulator.md#withBufferSerialized)
+* `AggregationIterator` is requested to [generateResultProjection](../aggregations/AggregationIterator.md#generateResultProjection)
+* `ObjectAggregationMap` is requested to [dumpToExternalSorter](../aggregations/ObjectAggregationMap.md#dumpToExternalSorter)
diff --git a/docs/logical-optimizations/InjectRuntimeFilter.md b/docs/logical-optimizations/InjectRuntimeFilter.md
@@ -4,6 +4,9 @@
 
 `InjectRuntimeFilter` is part of [InjectRuntimeFilter](../SparkOptimizer.md#InjectRuntimeFilter) fixed-point batch of rules.
 
+!!! note "Runtime Filter"
+    **Runtime Filter** can be a [BloomFilter](#hasBloomFilter) (with [spark.sql.optimizer.runtime.bloomFilter.enabled](../configuration-properties.md#spark.sql.optimizer.runtime.bloomFilter.enabled) enabled) or [InSubquery](#hasInSubquery) filter.
+
 !!! note "Noop"
     `InjectRuntimeFilter` is a _noop_ (and does nothing) for the following cases:
 
@@ -107,7 +110,30 @@ injectInSubqueryFilter(
   filterCreationSidePlan: LogicalPlan): LogicalPlan
 ```
 
-`injectInSubqueryFilter`...FIXME
+!!! note "The same `DataType`s"
+    `injectInSubqueryFilter` requires that the [DataType](../expressions/Expression.md#dataType)s of the given `filterApplicationSideExp` and `filterCreationSideExp` are the same.
+
+`injectInSubqueryFilter` creates an [Aggregate](../logical-operators/Aggregate.md) logical operator with the following:
+
+Property | Value
+---------|------
+[Grouping Expressions](../logical-operators/Aggregate.md#groupingExpressions) | The given `filterCreationSideExp` expression
+[Aggregate Expressions](../logical-operators/Aggregate.md#aggregateExpressions) | An `Alias` expression for the `filterCreationSideExp` expression (possibly [mayWrapWithHash](#mayWrapWithHash))
+[Child Logical Operator](../logical-operators/Aggregate.md#child) | The given `filterCreationSidePlan` expression
+
+`injectInSubqueryFilter` executes [ColumnPruning](../logical-optimizations/ColumnPruning.md) logical optimization on the `Aggregate` logical operator.
+
+Unless the `Aggregate` logical operator [canBroadcastBySize](../JoinSelectionHelper.md#canBroadcastBySize), `injectInSubqueryFilter` returns the given `filterApplicationSidePlan` logical plan (and basically throws away all the work so far).
+
+!!! note
+    `injectInSubqueryFilter` skips the `InSubquery` filter if the size of the `Aggregate` is beyond [broadcast join threshold](../JoinSelectionHelper.md#canBroadcastBySize) and the semi-join will be a shuffle join, which is not worthwhile.
+
+`injectInSubqueryFilter` creates an `InSubquery` logical operator with the following:
+
+* The given `filterApplicationSideExp` (possibly [mayWrapWithHash](#mayWrapWithHash))
+* [ListQuery](../expressions/ListQuery.md) expression with the `Aggregate`
+
+In the end, `injectInSubqueryFilter` creates a `Filter` logical operator with the `InSubquery` logical operator and the given `filterApplicationSidePlan` expression.
 
 !!! note
     `injectInSubqueryFilter` is used when `InjectRuntimeFilter` is requested to [injectFilter](#injectFilter) with [spark.sql.optimizer.runtime.bloomFilter.enabled](../configuration-properties.md#spark.sql.optimizer.runtime.bloomFilter.enabled) configuration properties disabled (unlike [spark.sql.optimizer.runtimeFilter.semiJoinReduction.enabled](../configuration-properties.md#spark.sql.optimizer.runtimeFilter.semiJoinReduction.enabled)).
@@ -128,3 +154,49 @@ isSimpleExpression(
 * `LIKE_FAMLIY`
 * `REGEXP_EXTRACT_FAMILY`
 * `REGEXP_REPLACE`
+
+## hasDynamicPruningSubquery { #hasDynamicPruningSubquery }
+
+```scala
+hasDynamicPruningSubquery(
+  left: LogicalPlan,
+  right: LogicalPlan,
+  leftKey: Expression,
+  rightKey: Expression): Boolean
+```
+
+`hasDynamicPruningSubquery` checks if there is a `Filter` logical operator with a [DynamicPruningSubquery](../expressions/DynamicPruningSubquery.md) expression on the `left` or `right` side (of a join).
+
+## hasRuntimeFilter { #hasRuntimeFilter }
+
+```scala
+hasRuntimeFilter(
+  left: LogicalPlan,
+  right: LogicalPlan,
+  leftKey: Expression,
+  rightKey: Expression): Boolean
+```
+
+`hasRuntimeFilter` checks if there is [hasBloomFilter](#hasBloomFilter) (with [spark.sql.optimizer.runtime.bloomFilter.enabled](../configuration-properties.md#spark.sql.optimizer.runtime.bloomFilter.enabled) enabled) or [hasInSubquery](#hasInSubquery) filter on the `left` or `right` side (of a join).
+
+## hasBloomFilter { #hasBloomFilter }
+
+```scala
+hasBloomFilter(
+  left: LogicalPlan,
+  right: LogicalPlan,
+  leftKey: Expression,
+  rightKey: Expression): Boolean
+```
+
+`hasBloomFilter` checks if there is [findBloomFilterWithExp](#findBloomFilterWithExp) on the `left` or `right` side (of a join).
+
+## findBloomFilterWithExp { #findBloomFilterWithExp }
+
+```scala
+findBloomFilterWithExp(
+  plan: LogicalPlan,
+  key: Expression): Boolean
+```
+
+`findBloomFilterWithExp` tries to find a `Filter` logical operator with a [BloomFilterMightContain](../expressions/BloomFilterMightContain.md) expression (and `XxHash64`) among the nodes of the given [LogicalPlan](../logical-operators/LogicalPlan.md).