Codegen in Spark 2.0

Launch spark-shell for Spark v2.2.

$ bin/spark-shell --conf spark.sql.codegen.comments=true
import org.apache.spark.sql.execution.debug._
spark.conf.set("spark.sql.codegen.wholeStage","true")
val loadCsv = spark.read.option("header","true").option("inferSchema","true").csv(_:String)
val basePath = "/data/ml-latest"
val movies = loadCsv(s"$basePath/movies.csv")
val ratings = loadCsv(s"$basePath/ratings.csv")
val moviesAgg = ratings.alias("t1")
.join(movies.alias("t2"), col("t1.movieId") === col("t2.movieId"))
.groupBy("t1.movieId","title")
.agg(avg("rating").alias("rating_avg"), count("*").alias("rating_count"))
.filter("rating_count >=100")
.orderBy(desc("rating_avg"))
.limit(10)

Explain plan:

moviesAgg.explain(true)

In the explain output, when an operator has a star around it (*), whole-stage code generation is enabled.

View the code gen output.

moviesAgg.debugCodegen

Effect of codegen

Because of the codgen, the dataset DSL or Spark SQL runs a lot faster.

spark.conf.set("spark.sql.codegen.wholeStage", true)
benchmark("Spark 2.0") {
  spark.range(1000L * 1000 * 1000).selectExpr("sum(id)").show()
}

Time taken in Spark 2.0: 0.584037925 seconds

Same code with codegen disabled,

Time taken in Spark 1.6: 13.113527926 seconds

Example notebook;

https://www.zepl.com/spaces/S_ZEPL/e6c838b40e2e406db5d19eeddeeffe36