Codegen in Spark 2.0

Launch spark-shell for Spark v2.2.

$ bin/spark-shell --conf spark.sql.codegen.comments=true

import org.apache.spark.sql.execution.debug._


val loadCsv ="header","true").option("inferSchema","true").csv(_:String)

val basePath = "/data/ml-latest"

val movies = loadCsv(s"$basePath/movies.csv")

val ratings = loadCsv(s"$basePath/ratings.csv")

val moviesAgg = ratings.alias("t1")

.join(movies.alias("t2"), col("t1.movieId") === col("t2.movieId"))


.agg(avg("rating").alias("rating_avg"), count("*").alias("rating_count"))

.filter("rating_count >=100")



Explain plan:


In the explain output, when an operator has a star around it (*), whole-stage code generation is enabled.

View the code gen output.


Effect of codegen

Because of the codgen, the dataset DSL or Spark SQL runs a lot faster.

spark.conf.set("spark.sql.codegen.wholeStage", true)

benchmark("Spark 2.0") {

spark.range(1000L * 1000 * 1000).selectExpr("sum(id)").show()


Time taken in Spark 2.0: 0.584037925 seconds

Same code with codegen disabled,

Time taken in Spark 1.6: 13.113527926 seconds

Example notebook;