Codegen in Spark 2.0

Launch spark-shell for Spark v2.2.

$ bin/spark-shell --conf spark.sql.codegen.comments=true

import org.apache.spark.sql.execution.debug._

spark.conf.set("spark.sql.codegen.wholeStage","true")

val loadCsv = spark.read.option("header","true").option("inferSchema","true").csv(_:String)

val basePath = "/data/ml-latest"

val movies = loadCsv(s"$basePath/movies.csv")

val ratings = loadCsv(s"$basePath/ratings.csv")

val moviesAgg = ratings.alias("t1")

.join(movies.alias("t2"), col("t1.movieId") === col("t2.movieId"))

.groupBy("t1.movieId","title")

.agg(avg("rating").alias("rating_avg"), count("*").alias("rating_count"))

.filter("rating_count >=100")

.orderBy(desc("rating_avg"))

.limit(10)

Explain plan:

moviesAgg.explain(true)

In the explain output, when an operator has a star around it (*), whole-stage code generation is enabled.

View the code gen output.

moviesAgg.debugCodegen

Effect of codegen

Because of the codgen, the dataset DSL or Spark SQL runs a lot faster.

spark.conf.set("spark.sql.codegen.wholeStage", true)

benchmark("Spark 2.0") {

spark.range(1000L * 1000 * 1000).selectExpr("sum(id)").show()

}

Time taken in Spark 2.0: 0.584037925 seconds

Same code with codegen disabled,

Time taken in Spark 1.6: 13.113527926 seconds

Example notebook;

https://www.zepl.com/spaces/S_ZEPL/e6c838b40e2e406db5d19eeddeeffe36