Codegen in Spark 2.0
Launch spark-shell for Spark v2.2.
$ bin/spark-shell --conf spark.sql.codegen.comments=true
import org.apache.spark.sql.execution.debug._
spark.conf.set("spark.sql.codegen.wholeStage","true")
val loadCsv = spark.read.option("header","true").option("inferSchema","true").csv(_:String)
val basePath = "/data/ml-latest"
val movies = loadCsv(s"$basePath/movies.csv")
val ratings = loadCsv(s"$basePath/ratings.csv")
val moviesAgg = ratings.alias("t1")
.join(movies.alias("t2"), col("t1.movieId") === col("t2.movieId"))
.groupBy("t1.movieId","title")
.agg(avg("rating").alias("rating_avg"), count("*").alias("rating_count"))
.filter("rating_count >=100")
.orderBy(desc("rating_avg"))
.limit(10)
Explain plan:
moviesAgg.explain(true)
In the explain output, when an operator has a star around it (*), whole-stage code generation is enabled.
View the code gen output.
moviesAgg.debugCodegen
Effect of codegen
Because of the codgen, the dataset DSL or Spark SQL runs a lot faster.
spark.conf.set("spark.sql.codegen.wholeStage", true)
benchmark("Spark 2.0") {
spark.range(1000L * 1000 * 1000).selectExpr("sum(id)").show()
}
Time taken in Spark 2.0: 0.584037925 seconds
Same code with codegen disabled,
Time taken in Spark 1.6: 13.113527926 seconds
Example notebook;
https://www.zepl.com/spaces/S_ZEPL/e6c838b40e2e406db5d19eeddeeffe36