Skip to content

Commit 53410bb

Browse files
BloomFilter, BloomFilterImpl and Bloom Filter Join
1 parent 7187d27 commit 53410bb

File tree

8 files changed

+77
-26
lines changed

8 files changed

+77
-26
lines changed

docs/Dataset.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
---
2+
title: Dataset
3+
---
4+
15
# Dataset
26

37
`Dataset[T]` is a strongly-typed data structure that represents a structured query over rows of `T` type.

docs/bloom-filter-join/BloomFilter.md

Lines changed: 30 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,45 +3,65 @@
33
`BloomFilter` is an [abstraction](#contract) of [bloom filters](#implementations) for the following:
44

55
* [DataFrameStatFunctions.bloomFilter](../DataFrameStatFunctions.md#bloomFilter) operator
6-
* As an [aggregation buffer](../expressions/BloomFilterAggregate.md#createAggregationBuffer) in [BloomFilterAggregate](../expressions/BloomFilterAggregate.md) expression
6+
* [BloomFilterAggregate](../expressions/BloomFilterAggregate.md) expression (as an [aggregation buffer](../expressions/BloomFilterAggregate.md#createAggregationBuffer))
77
* [BloomFilterMightContain](../expressions/BloomFilterMightContain.md#bloomFilter) expression
88

99
## Contract (Subset)
1010

11-
### <span id="mightContain"> mightContain
11+
### bitSize { #bitSize }
12+
13+
```java
14+
long bitSize()
15+
```
16+
17+
See:
18+
19+
* [BloomFilterImpl](BloomFilterImpl.md#bitSize)
20+
21+
Used when:
22+
23+
* `BloomFilterAggregate` is requested to [serialize a BloomFilter](../expressions/BloomFilterAggregate.md#serialize)
24+
25+
### mightContain { #mightContain }
1226

1327
```java
1428
boolean mightContain(
1529
Object item)
1630
```
1731

18-
See [BloomFilterImpl](BloomFilterImpl.md#mightContain)
32+
See:
33+
34+
* [BloomFilterImpl](BloomFilterImpl.md#mightContain)
1935

2036
!!! note "Not Used"
2137
`mightContain` does not seem to be used (as [mightContainLong](#mightContainLong) seems to be used directly instead).
2238

23-
### <span id="mightContainLong"> mightContainLong
39+
### mightContainLong { #mightContainLong }
2440

2541
```java
2642
boolean mightContainLong(
2743
long item)
2844
```
2945

30-
See [BloomFilterImpl](BloomFilterImpl.md#mightContainLong)
46+
See:
47+
48+
* [BloomFilterImpl](BloomFilterImpl.md#mightContainLong)
3149

3250
Used when:
3351

3452
* `BloomFilterImpl` is requested to [mightContain](BloomFilterImpl.md#mightContain)
35-
* `BloomFilterMightContain` is requested to [eval](../expressions/BloomFilterMightContain.md#eval) and [doGenCode](../expressions/BloomFilterMightContain.md#doGenCode)
53+
* `BloomFilterMightContain` is requested to [evaluate](../expressions/BloomFilterMightContain.md#eval) and [doGenCode](../expressions/BloomFilterMightContain.md#doGenCode)
3654

37-
### <span id="mightContainString"> mightContainString
55+
### mightContainString { #mightContainString }
3856

3957
```java
4058
boolean mightContainString(
4159
String item)
4260
```
4361

44-
See [BloomFilterImpl](BloomFilterImpl.md#mightContainString)
62+
See:
63+
64+
* [BloomFilterImpl](BloomFilterImpl.md#mightContainString)
4565

4666
Used when:
4767

@@ -51,7 +71,7 @@ Used when:
5171

5272
* [BloomFilterImpl](BloomFilterImpl.md)
5373

54-
## <span id="create"> Creating BloomFilter
74+
## Creating BloomFilter { #create }
5575

5676
```java
5777
BloomFilter create(
@@ -66,7 +86,7 @@ BloomFilter create(
6686

6787
`create` creates a [BloomFilterImpl](BloomFilterImpl.md) for the given `expectedNumItems`.
6888

69-
Unless the false positive probability is given, `create` uses [DEFAULT_FPP](#DEFAULT_FPP) value to [determine the number of bits](#optimalNumOfBits).
89+
Unless the **False Positive Probability** (`fpp`) is given, `create` uses [DEFAULT_FPP](#DEFAULT_FPP) value to [determine the optimal number of bits](#optimalNumOfBits).
7090

7191
---
7292

docs/bloom-filter-join/BloomFilterImpl.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,20 @@
1313

1414
* `BloomFilter` is requested to [create a BloomFilter](BloomFilter.md#create)
1515

16-
## <span id="mightContainLong"> mightContainLong
16+
## mightContainLong { #mightContainLong }
1717

18-
```java
19-
boolean mightContainLong(
20-
long item)
21-
```
18+
??? note "BloomFilter"
2219

23-
`mightContainLong` is part of the [BloomFilter](BloomFilter.md#mightContainLong) abstraction.
20+
```java
21+
boolean mightContainLong(
22+
long item)
23+
```
2424

25-
---
25+
`mightContainLong` is part of the [BloomFilter](BloomFilter.md#mightContainLong) abstraction.
2626

27-
`mightContainLong`...FIXME
27+
`mightContainLong` uses `Murmur3_x86_32` to generate two hashes of the given `item` with two different seeds: `0` and the hash result of the first hashing.
28+
29+
`mightContainLong` requests the [BitArray](#bits) for the number of bits (`bitSize`).
30+
31+
In the end, `mightContainLong` checks out if the bit for the hashes (combined) is set (non-zero) in the [BitArray](#bits) up to [numHashFunctions](#numHashFunctions) times.
32+
With all the bits checked and set, `mightContainLong` is positive. Otherwise, `mightContainLong` is negative.

docs/bloom-filter-join/index.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
11
# Bloom Filter Join
22

3-
**Bloom Filter Join** is an optimization of join queries by pre-filtering one side of a join using a Bloom filter and IN predicate based on the values from the other side of the join.
3+
**Bloom Filter Join** is an optimization of join queries by pre-filtering one side of a join using [BloomFilter](BloomFilter.md) or `InSubquery` predicate based on the values from the other side of the join.
4+
5+
Bloom Filter Join uses [BloomFilter](BloomFilter.md)s as runtime filters when [spark.sql.optimizer.runtime.bloomFilter.enabled](../configuration-properties.md#spark.sql.optimizer.runtime.bloomFilter.enabled) configuration property is enabled.
6+
7+
Bloom Filter Join uses [InjectRuntimeFilter](../logical-optimizations/InjectRuntimeFilter.md) logical optimization to inject up to [spark.sql.optimizer.runtimeFilter.number.threshold](../configuration-properties.md#spark.sql.optimizer.runtimeFilter.number.threshold) filters ([BloomFilter](BloomFilter.md)s or `InSubquery`s).
48

59
??? note "SPARK-32268"
610
Bloom Filter Join was introduced in [SPARK-32268]({{ spark.jira }}/SPARK-32268).
711

8-
Bloom Filter Join uses [InjectRuntimeFilter](../logical-optimizations/InjectRuntimeFilter.md) logical optimization to...FIXME
9-
1012
## Configuration Properties
1113

1214
* [spark.sql.optimizer.runtime.bloomFilter.enabled](../configuration-properties.md#spark.sql.optimizer.runtime.bloomFilter.enabled)

docs/logical-optimizations/InjectRuntimeFilter.md

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,18 +25,24 @@
2525

2626
With [runtimeFilterSemiJoinReductionEnabled](../SQLConf.md#runtimeFilterSemiJoinReductionEnabled) enabled and the new and the initial logical plans not equal, `apply` executes [RewritePredicateSubquery](RewritePredicateSubquery.md) logical optimization with the new logical plan. Otherwise, `apply` returns the new logical plan.
2727

28-
## <span id="tryInjectRuntimeFilter"> tryInjectRuntimeFilter
28+
## tryInjectRuntimeFilter { #tryInjectRuntimeFilter }
2929

3030
```scala
3131
tryInjectRuntimeFilter(
3232
plan: LogicalPlan): LogicalPlan
3333
```
3434

35-
`tryInjectRuntimeFilter` [finds equi-joins](../ExtractEquiJoinKeys.md#unapply) in the given [LogicalPlan](../logical-operators/LogicalPlan.md).
35+
`tryInjectRuntimeFilter` transforms the given [LogicalPlan](../logical-operators/LogicalPlan.md) with regards to [equi-joins](../ExtractEquiJoinKeys.md#unapply).
3636

37-
When _some_ requirements are met, `tryInjectRuntimeFilter` [injectFilter](#injectFilter) on the left side first and on the right side if on the left was not successful.
37+
For every equi-join, `tryInjectRuntimeFilter` [injects a runtime filter](#injectFilter) (on the left side first and on the right side if on the left was not successful) when all the following requirements are met:
3838

39-
`tryInjectRuntimeFilter` uses [spark.sql.optimizer.runtimeFilter.number.threshold](../configuration-properties.md#spark.sql.optimizer.runtimeFilter.number.threshold) configuration property.
39+
1. A join side has no [DynamicPruningSubquery](#hasDynamicPruningSubquery) filter already
40+
1. A join side has no [RuntimeFilter](#hasRuntimeFilter)
41+
1. The left and right keys (pair-wise) are [simple expression](#isSimpleExpression)s
42+
1. [canPruneLeft](../JoinSelectionHelper.md#canPruneLeft) or [canPruneRight](../JoinSelectionHelper.md#canPruneRight)
43+
1. [filteringHasBenefit](#filteringHasBenefit)
44+
45+
`tryInjectRuntimeFilter` tries to inject up to [spark.sql.optimizer.runtimeFilter.number.threshold](../configuration-properties.md#spark.sql.optimizer.runtimeFilter.number.threshold) filters.
4046

4147
## Injecting Filter Operator { #injectFilter }
4248

@@ -48,7 +54,9 @@ injectFilter(
4854
filterCreationSidePlan: LogicalPlan): LogicalPlan
4955
```
5056

51-
`injectFilter`...FIXME
57+
With [spark.sql.optimizer.runtime.bloomFilter.enabled](../configuration-properties.md#spark.sql.optimizer.runtime.bloomFilter.enabled), `injectFilter` [injects a filter using BloomFilter](#injectBloomFilter).
58+
59+
Otherwise, `injectFilter` [injects a filter using InSubquery](#injectInSubqueryFilter).
5260

5361
### Injecting BloomFilter { #injectBloomFilter }
5462

docs/spark-sql-DataFrameNaFunctions.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
---
2+
title: DataFrameNaFunctions
3+
---
4+
15
# DataFrameNaFunctions &mdash; Working With Missing Data
26

37
`DataFrameNaFunctions` is used to work with <<methods, missing data>> in a structured query (a [DataFrame](DataFrame.md)).

docs/spark-sql-Dataset-basic-actions.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
---
2+
title: Basic Actions
3+
---
4+
15
# Dataset API &mdash; Basic Actions
26

37
**Basic actions** are a set of operators (_methods_) of the <<spark-sql-dataset-operators.md#, Dataset API>> for transforming a `Dataset` into a session-scoped or global temporary view and _other basic actions_ (FIXME).

docs/spark-sql-dataset-operators.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
---
2+
title: Operators
3+
---
4+
15
# Dataset API &mdash; Dataset Operators
26

37
Dataset API is a [set of operators](#methods) with typed and untyped transformations, and actions to work with a structured query (as a [Dataset](Dataset.md)) as a whole.

0 commit comments

Comments
 (0)