Skip to content

feat(spark): build and publish multi-variant targets#707

Merged
andrew-coleman merged 3 commits intosubstrait-io:mainfrom
andrew-coleman:spark4
Mar 4, 2026
Merged

feat(spark): build and publish multi-variant targets#707
andrew-coleman merged 3 commits intosubstrait-io:mainfrom
andrew-coleman:spark4

Conversation

@andrew-coleman
Copy link
Member

@andrew-coleman andrew-coleman commented Feb 20, 2026

This PR adds the framework to build and publish multiple Substrait Spark packages targeted at different versions of Scala and Spark.

This is a large PR, however, no functional changes are made. It is split into three commits:

  1. Upgrading the Spark dependency to version 4.0.2.
    • There were a few breaking API changes that had to be resolved.
    • Two new functions lpad and rpad had to be added because the Spark 4.0.2 query planner/optimizer inserted them for many of the TPCDS queries. The dialect.yaml was regenerated accordingly.
  2. Implement multi-variant build and publish.
    • Supporting the version matrix:
      • Scala 2.12 and 2.13.
      • Spark 3.4, 3.5 and 4.0.
    • Added build targets:
      • Spark 3.4 with Scala 2.12
      • Spark 3.5 with Scala 2.12
      • Spark 4.0 with Scala 2.13
    • Extra targets can be added later. A readme file has been added describing how.
  3. Implemented the code for each of these target versions.
    • Since there were breaking API changes between v3.x and v4.x, this commit refactors the code to abstract affected API calls into a compatibility interface which was implemented by each version.
    • The dialect generator was updated to order the functions alphabetically by name. This was necessary because previously it was ordered by how scala iterated over the contents of a map, and the two versions of scala did that differently. The newly generated dialect file in this commit is simply a re-ordered version of the previous one.
    • The gradle script was update to support the publication of the three targets:
      • spark34_2.12, spark35_2.12 and spark40_2.13

@andrew-coleman andrew-coleman marked this pull request as draft February 20, 2026 14:49
@andrew-coleman andrew-coleman force-pushed the spark4 branch 9 times, most recently from 06be1c4 to 3f137e8 Compare February 27, 2026 13:41
@andrew-coleman andrew-coleman marked this pull request as ready for review February 27, 2026 14:24
@andrew-coleman andrew-coleman force-pushed the spark4 branch 2 times, most recently from a7c0931 to ecdc2e6 Compare March 3, 2026 10:36
WIP

Signed-off-by: Andrew Coleman <andrew_coleman@uk.ibm.com>
Adds support for building/publishing multiple targets for
different versions of scala (2.12, 2.13) and spark (3.4, 3.5, 4.0).
Other combinations can be added in the future.

Signed-off-by: Andrew Coleman <andrew_coleman@uk.ibm.com>

# Conflicts:
#	gradle/libs.versions.toml
There are a number of Spark API breaking changes between
3.x and 4.x.
This commit refactors the code to abstract affected API calls
into a compatibility interface which can be implemented by each version.

Signed-off-by: Andrew Coleman <andrew_coleman@uk.ibm.com>
Copy link
Member

@bestbeforetoday bestbeforetoday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things that can be tidied up, either in this PR or in a later PR (see comments). Generally looks OK to me though so approved for you to merge when you are ready.

Comment on lines +113 to +125
// Scala dependencies
implementation("org.scala-lang:scala-library:$scalaVersion")
testImplementation("org.scalatest:scalatest_$scalaBinary:3.2.19")
testRuntimeOnly("org.scalatestplus:junit-5-12_$scalaBinary:3.2.19.0")

// Spark dependencies
api("org.apache.spark:spark-core_$scalaBinary:$sparkVersion")
api("org.apache.spark:spark-sql_$scalaBinary:$sparkVersion")
implementation("org.apache.spark:spark-hive_$scalaBinary:$sparkVersion")
implementation("org.apache.spark:spark-catalyst_$scalaBinary:$sparkVersion")
testImplementation("org.apache.spark:spark-core_$scalaBinary:$sparkVersion:tests")
testImplementation("org.apache.spark:spark-sql_$scalaBinary:$sparkVersion:tests")
testImplementation("org.apache.spark:spark-catalyst_$scalaBinary:$sparkVersion:tests")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be using values from libs here but that can be tidied up later.

Comment on lines +13 to +16
val sparkVersion = "3.4.4"
val scalaVersion = "2.12.20"
val sparkMajorMinor = "3.4"
val scalaBinary = "2.12"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to pull these values out of libs to keep everything consistent and the versions defined in a single place. This can be tidied up later though.

Comment on lines +19 to +21
"spark34_2.12" to Triple(":spark:spark-3.4_2.12", "3.4.4", "2.12"),
"spark35_2.12" to Triple(":spark:spark-3.5_2.12", "3.5.4", "2.12"),
"spark40_2.13" to Triple(":spark:spark-4.0_2.13", "4.0.2", "2.13"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice for these versions to be centralised in libs. That can be tidied up later.

Comment on lines +87 to +88
// case plan if SparkCompat.instance.supportsWindowGroupLimit =>
// SparkCompat.instance.handleWindowGroupLimit(plan, visit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should remove commented out code.

Comment on lines +105 to +117
### Publish to Maven Central Portal

Publish all variants to Maven Central:

```bash
./gradlew :spark:publishAllVariantsToCentralPortal
```

Or publish a specific variant:

```bash
./gradlew :spark:spark-4.0_2.13:publishMaven-publishPublicationToNmcpRepository
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this section is correct. It can be reviewed and corrected (or removed) later.

@andrew-coleman
Copy link
Member Author

Thanks Mark. In the interest of getting this delivered, I'll merge now and address your comments in a followup PR.

@andrew-coleman andrew-coleman merged commit d3ce994 into substrait-io:main Mar 4, 2026
12 checks passed
@andrew-coleman andrew-coleman deleted the spark4 branch March 4, 2026 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants