Spark Dataframe with Python (Pyspark)

You can use pyspark in terminal or Jupyter notebook. Following these instruction to start Jupyter notebook with Pyspark support.

Load Spark SQL functions, for example: count, avg, explode etc.

from pyspark.sql.functions import *

Location of movies dataset. You can download the dataset from here. Here we are using latest-small dataset.

home_dir = "/user/cloudera/movie-lens"

Create a dataframe on movies.csv file

movies = spark.read.format("csv")\
.options(header = True, inferSchema = True)\
.load(home_dir + "/movies")\
.persist()

Show schema of movies dataframe

movies.printSchema()
root  |-- movieId: integer (nullable = true)  |-- title: string (nullable = true)  |-- genres: string (nullable = true)

Display a few sample view from movies Dataframe.

movies.show(5)
+-------+--------------------+--------------------+ |movieId|               title|              genres| +-------+--------------------+--------------------+ |      1|    Toy Story (1995)|Adventure|Animati...| |      2|      Jumanji (1995)|Adventure|Childre...| |      3|Grumpier Old Men ...|      Comedy|Romance| |      4|Waiting to Exhale...|Comedy|Drama|Romance| |      5|Father of the Bri...|              Comedy| +-------+--------------------+--------------------+ only showing top 5 rows

Create ratings Dataframe using ratings.csv file

ratings = spark.read.format("csv")\ .options(header = True, inferSchema = True)\ .load(home_dir + "/ratings")\ .persist() 

Print schema of ratings Dataframe

ratings.printSchema()
root  |-- userId: integer (nullable = true)  |-- movieId: integer (nullable = true)  |-- rating: double (nullable = true)  |-- timestamp: integer (nullable = true)

Show Schema of ratings Dataframe.

ratings.show(5)
+------+-------+------+----------+ |userId|movieId|rating| timestamp| +------+-------+------+----------+ |     1|     16|   4.0|1217897793| |     1|     24|   1.5|1217895807| |     1|     32|   4.0|1217896246| |     1|     47|   4.0|1217896556| |     1|     50|   4.0|1217896523| +------+-------+------+----------+ only showing top 5 rows

Find average rating of each movie for which there are at least 100 ratings. Order the result by average rating in decreasing order.

ratings_agg = ratings\
.groupBy(col("movieId"))\
.agg(
    count(col("movieId")).alias("count"),
    avg(col("rating")).alias("avg_rating")
).alias("t1")
ratings_agg\
.join(movies.alias("t2"), col("t1.movieId") == col("t2.movieId"))\
.filter("count > 100")\
.orderBy(desc("avg_rating"))\
.show()
+-------+-----+------------------+-------+--------------------+--------------------+ |movieId|count|        avg_rating|movieId|               title|              genres| +-------+-----+------------------+-------+--------------------+--------------------+ |    318|  308| 4.454545454545454|    318|Shawshank Redempt...|         Crime|Drama| |    858|  210| 4.392857142857143|    858|Godfather, The (1...|         Crime|Drama| |     50|  228| 4.328947368421052|     50|Usual Suspects, T...|Crime|Mystery|Thr...| |   1136|  154|4.3019480519480515|   1136|Monty Python and ...|Adventure|Comedy|...| |    527|  248| 4.296370967741935|    527|Schindler's List ...|           Drama|War| |   1193|  143|4.2727272727272725|   1193|One Flew Over the...|               Drama| |    608|  201|4.2711442786069655|    608|        Fargo (1996)|Comedy|Crime|Dram...| |   2571|  261| 4.264367816091954|   2571|  Matrix, The (1999)|Action|Sci-Fi|Thr...| |   1221|  140| 4.260714285714286|   1221|Godfather: Part I...|         Crime|Drama| |   1213|  135|4.2592592592592595|   1213|   Goodfellas (1990)|         Crime|Drama| |    912|  125|             4.236|    912|   Casablanca (1942)|       Drama|Romance| |   1196|  228| 4.228070175438597|   1196|Star Wars: Episod...|Action|Adventure|...| |   1198|  224| 4.212053571428571|   1198|Raiders of the Lo...|    Action|Adventure| |    593|  290| 4.194827586206896|    593|Silence of the La...|Crime|Horror|Thri...| |  79132|  103| 4.189320388349515|  79132|    Inception (2010)|Action|Crime|Dram...| |    260|  273| 4.188644688644689|    260|Star Wars: Episod...|Action|Adventure|...| |   2959|  207| 4.188405797101449|   2959|   Fight Club (1999)|Action|Crime|Dram...| |    541|  156| 4.169871794871795|    541| Blade Runner (1982)|Action|Sci-Fi|Thr...| |   1704|  140| 4.167857142857143|   1704|Good Will Hunting...|       Drama|Romance| |   1197|  171|  4.16374269005848|   1197|Princess Bride, T...|Action|Adventure|...| +-------+-----+------------------+-------+--------------------+--------------------+ only showing top 20 rows

Find average rating of each genre.

genre_avg_rating = ratings.alias("t1")\ .join(movies.alias("t2"), col("t1.movieId") == col("t2.movieId"))\ .select(col("rating"), explode(split("genres", r"\|")).alias("genre"))\ .groupBy(col("genre"))\ .agg(count(col("genre")).alias("count"), avg("rating").alias("avg_rating")) \ .orderBy(desc("avg_rating"))  genre_avg_rating.show()
+------------------+-----+------------------+ |             genre|count|        avg_rating| +------------------+-----+------------------+ |         Film-Noir| 1210|3.9136363636363636| |               War| 5828|3.7832017844886754| |           Mystery| 8320| 3.652043269230769| |             Drama|46960|3.6502661839863713| |       Documentary| 1206|3.6430348258706466| |             Crime|18291|3.6423924334372098| |              IMAX| 3032| 3.641820580474934| |         Animation| 5966|3.6353503184713376| |           Musical| 4287|  3.57196174480989| |           Western| 2314| 3.565687121866897| |           Romance|19094| 3.544254739708809| |         Adventure|23076| 3.518027387762177| |           Fantasy|10889|3.5004591789879695| |          Thriller|29288|3.4955613220431574| |            Sci-Fi|16795|3.4544805001488537| |            Action|31205|3.4514500881269026| |          Children| 8098|3.4394294887626575| |            Comedy|38055|3.4209959269478385| |            Horror| 7983| 3.281097331830139| |(no genres listed)|    7|3.0714285714285716| +------------------+-----+------------------+

Using matplotlib show barplot of average rating for each genre.

Loading matplotlib library.

import pandas as pd import matplotlib.pyplot as plt %matplotlib inline #Required to display graph in jupyter notebook

Convert spark dataframe to Pandas Dataframe.

df = genre_avg_rating.toPandas()

Plot average rating for each genre.

df.plot("genre", "avg_rating", "bar", title = "Barplot of avg rating by genre")

Notebook: there are more notebook in the github repo.

https://github.com/abulbasar/pyspark-examples