Catalyst and Tungsten

sql("select t1.movieId, avg(t2.rating) from movies2 t1 left join ratings2 t2 on t1.movieid = t2.movieid where t1.movieid = 1 group by t1.movieid ").explain

CSV File

-----------------------------------------------------------------------------

== Physical Plan ==

*HashAggregate(keys=[movieid#66], functions=[avg(cast(rating#88 as double))])

+- Exchange hashpartitioning(movieid#66, 200)

+- *HashAggregate(keys=[movieid#66], functions=[partial_avg(cast(rating#88 as double))])

+- *Project [movieId#66, rating#88]

+- *BroadcastHashJoin [movieid#66], [movieid#87], LeftOuter, BuildRight

:- *Project [movieId#66]

: +- *Filter (isnotnull(movieid#66) && (cast(movieid#66 as int) = 1))

: +- *FileScan csv [movieId#66] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/abulbasar/workspace/python/machine-learning/data/ml-latest-small/mo..., PartitionFilters: [], PushedFilters: [IsNotNull(movieId)], ReadSchema: struct<movieId:string>

+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))

+- *FileScan csv [movieId#87,rating#88] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/abulbasar/workspace/python/machine-learning/data/ml-latest-small/ra..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<movieId:string,rating:string>

Parquet File

-------------------------------------------------------------------------------------------------

== Physical Plan ==

*HashAggregate(keys=[movieid#193], functions=[avg(cast(rating#203 as double))])

+- Exchange hashpartitioning(movieid#193, 200)

+- *HashAggregate(keys=[movieid#193], functions=[partial_avg(cast(rating#203 as double))])

+- *Project [movieId#193, rating#203]

+- *BroadcastHashJoin [movieid#193], [movieid#202], LeftOuter, BuildRight

:- *Project [movieId#193]

: +- *Filter (isnotnull(movieid#193) && (cast(movieid#193 as int) = 1))

: +- *FileScan parquet [movieId#193] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/abulbasar/workspace/python/machine-learning/data/ml-latest-small/mo..., PartitionFilters: [], PushedFilters: [IsNotNull(movieId)], ReadSchema: struct<movieId:string>

+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))

+- *FileScan parquet [movieId#202,rating#203] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/abulbasar/workspace/python/machine-learning/data/ml-latest-small/ra..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<movieId:string,rating:string>