Simple Dataframe Operations

Load movies.csv as a data frame

val movies = spark

.read

.format("csv") // Indicates the format is of csv type

.option("header", "true")

.load("/user/cloudera/movielens/movies")

View a few sample values

movies.show()

Print the schema of movies dataframe

movies.printSchema

Load ratings.csv as dataframe

val ratings = spark

.read

.format("csv")

.options(Map(

"header" -> "true", //indicates the file contains header

"inferSchema" -> "true", //infer schema for field type. If not specified all fields will be of string type

"sep" -> "," // Specity seperator. Default is command(,).

))

.load("/user/cloudera/movielens/ratings")

Print schema of ratings dataframe

ratings.printSchema

View a few sample values

ratings.show

Get access to underlying RDD. Notice, the RDD contains object of type Row class.

ratings.rdd

movies.registerTempTable("movies")

ratings.registerTempTable("ratings")

View the available table. The temporary tables shows up and "Is Temporary" attribute of the table should true.

sql("show tables").show

Now, you are write SQL query on the registered temporary table. Here, we are trying to fing our the avarage rating of the movies by movieId and title.

val ratingsAvg = sql("""select t1.movieid, t1.title, avg(t2.rating) avg_rating from movies t1

join ratings t2 on t1.movieid = t2.movieid group by t1.movieId, t1.title""")

View some values from the dataframe generated out of the query.

ratingsAvg.show

Let's save the dataframe to a persistent file system.

ratingsAvg

.coalesce(1) // reducing partition ensures the num of output file will be 1

.write

.format("parquet") // format of the output file. You can specify csv, json etc.

.mode("overwrite") // If you have exiting file in the target dir, spark will overwrite

.save("/user/cloudera/ratingsAvg") //target dir where the files will be save

sql("select * from parquet.`/user/cloudera/ratingsAvg`").show(5)