Codegen in Spark 2.0

Launch spark-shell for Spark v2.2.

$ bin/spark-shell --conf spark.sql.codegen.comments=true
import org.apache.spark.sql.execution.debug._
val loadCsv ="header","true").option("inferSchema","true").csv(_:String)
val basePath = "/data/ml-latest"
val movies = loadCsv(s"$basePath/movies.csv")
val ratings = loadCsv(s"$basePath/ratings.csv")
val moviesAgg = ratings.alias("t1")
.join(movies.alias("t2"), col("t1.movieId") === col("t2.movieId"))
.agg(avg("rating").alias("rating_avg"), count("*").alias("rating_count"))
.filter("rating_count >=100")

Explain plan:


In the explain output, when an operator has a star around it (*), whole-stage code generation is enabled.

View the code gen output.


Effect of codegen

Because of the codgen, the dataset DSL or Spark SQL runs a lot faster.

spark.conf.set("spark.sql.codegen.wholeStage", true)
benchmark("Spark 2.0") {
  spark.range(1000L * 1000 * 1000).selectExpr("sum(id)").show()

Time taken in Spark 2.0: 0.584037925 seconds

Same code with codegen disabled,

Time taken in Spark 1.6: 13.113527926 seconds

Example notebook;