Catalyst and Tungsten
sql("select t1.movieId, avg(t2.rating) from movies2 t1 left join ratings2 t2 on t1.movieid = t2.movieid where t1.movieid = 1 group by t1.movieid ").explain
CSV File
-----------------------------------------------------------------------------
== Physical Plan ==
*HashAggregate(keys=[movieid#66], functions=[avg(cast(rating#88 as double))])
+- Exchange hashpartitioning(movieid#66, 200)
+- *HashAggregate(keys=[movieid#66], functions=[partial_avg(cast(rating#88 as double))])
+- *Project [movieId#66, rating#88]
+- *BroadcastHashJoin [movieid#66], [movieid#87], LeftOuter, BuildRight
:- *Project [movieId#66]
: +- *Filter (isnotnull(movieid#66) && (cast(movieid#66 as int) = 1))
: +- *FileScan csv [movieId#66] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/abulbasar/workspace/python/machine-learning/data/ml-latest-small/mo..., PartitionFilters: [], PushedFilters: [IsNotNull(movieId)], ReadSchema: struct<movieId:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- *FileScan csv [movieId#87,rating#88] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/abulbasar/workspace/python/machine-learning/data/ml-latest-small/ra..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<movieId:string,rating:string>
Parquet File
-------------------------------------------------------------------------------------------------
== Physical Plan ==
*HashAggregate(keys=[movieid#193], functions=[avg(cast(rating#203 as double))])
+- Exchange hashpartitioning(movieid#193, 200)
+- *HashAggregate(keys=[movieid#193], functions=[partial_avg(cast(rating#203 as double))])
+- *Project [movieId#193, rating#203]
+- *BroadcastHashJoin [movieid#193], [movieid#202], LeftOuter, BuildRight
:- *Project [movieId#193]
: +- *Filter (isnotnull(movieid#193) && (cast(movieid#193 as int) = 1))
: +- *FileScan parquet [movieId#193] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/abulbasar/workspace/python/machine-learning/data/ml-latest-small/mo..., PartitionFilters: [], PushedFilters: [IsNotNull(movieId)], ReadSchema: struct<movieId:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- *FileScan parquet [movieId#202,rating#203] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/abulbasar/workspace/python/machine-learning/data/ml-latest-small/ra..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<movieId:string,rating:string>