04 Joining Datasets using R

For illustration we will use movielens dataset.

Please download small dataset, ~ 1MB in size, from http://grouplens.org/datasets/movielens/.

Look at readme.html for data description.

Unzip the file and set your R working directory to the directory of unzipped files.

> setwd("/data/movie_lens/ml-latest-small")

We want to display name and average rating for each movie.

Load movies and ratings into 2 data frames.

> movies = read.csv("movies.csv")
> ratings = read.csv("ratings.csv")

View the structure

> str(movies)

'data.frame': 10329 obs. of 3 variables:

 $ movieId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ title  : Factor w/ 10327 levels "¡Three Amigos! (1986)",..: 9391 4930 3876 9797 3148 4072 7852 9335 8741 3723 ...
 $ genres : Factor w/ 938 levels "(no genres listed)",..: 338 396 698 651 601 245 698 380 2 123 ...

This dataset has 10329 rows.

> str(ratings)

'data.frame': 105339 obs. of 4 variables:

 $ userId   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ movieId  : int  16 24 32 47 50 110 150 161 165 204 ...
 $ rating   : num  4 1.5 4 4 4 4 3 4 3 0.5 ...
 $ timestamp: int  1217897793 1217895807 1217896246 1217896556 1217896523 1217896150 1217895940 1217897864 1217897135 1217895786 ...

This dataset has 105,339 rows.

Find whether is any duplicate movieId in movies dataset.

> movieId.freq = as.data.frame(table(movies$movieId))
> any(movieId.freq$Freq > 1)
[1] FALSE

It shows that there is no duplicate in movieId field.

Intuition is there is going to duplicate for movieId in ratings table as it is a fact table.

Find out average count of ratings for movies.

> ratings.movieId = table(ratings$movieId)
> mean(ratings.movieId) # should be 10.20

Find all movies with "Rocky" in title

> movies[grep("rocky", movies$title, ignore.case = TRUE), ]
     movieId                                          title                                      genres
1521    1954                                   Rocky (1976)                                       Drama
1919    2409                                Rocky II (1979)                                Action|Drama
1920    2410                               Rocky III (1982)                                Action|Drama
1921    2411                                Rocky IV (1985)                                Action|Drama
1922    2412                                 Rocky V (1990)                                Action|Drama
2122    2657          Rocky Horror Picture Show, The (1975)                Comedy|Horror|Musical|Sci-Fi
2968    3754 Adventures of Rocky and Bullwinkle, The (2000) Adventure|Animation|Children|Comedy|Fantasy
7222   49651                            Rocky Balboa (2006)                                Action|Drama
9089   89045                               Rocky VI (1986)                                       Comedy

How many ratings are there for "Rocky IV" that is with movieId = 2411.

> mean(ratings[ratings$movieId == "2411", ]$rating)

Now, let's join the movie.

> ratings.merged = merge(ratings, movies, by = "movieId")
> head(ratings.merged)
  movieId userId rating  timestamp            title                                      genres
1       1    103    3.0  978554606 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2       1    328    5.0 1237950888 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
3       1    453    4.0  859302647 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
4       1    590    3.5 1243424645 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
5       1    419    4.0 1447594754 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
6       1    180    3.5 1180846101 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

To test the joined data, let's find the average rating from joined data. It should match with above.

> mean(ratings.merged$rating[ratings.merged$movieId == "2411"])

More efficient option could have been find average rating from ratings table for each movieid then join.

Find top 6 movies based on highest average rating. Consider those movies that have received at least 100 ratings in total.

> require(plyr)
> movies = read.csv("data/ml-latest-small/movies.csv")
> ratings = read.csv("data/ml-latest-small/ratings.csv")
> df = merge(x = movies, y = ratings, by.x = "movieId", by.y = "movieId")
> df = ddply(df, c("movieId", "title"), function(x) c(avgRating = mean(x$rating), count = nrow(x)))
> df = subset(df, count > 100)
> df = arrange(df, desc(avgRating))
> head(df)
  movieId                                  title avgRating count
1     858                  Godfather, The (1972)  4.487500   200
2     318       Shawshank Redemption, The (1994)  4.487138   311
3    1221         Godfather: Part II, The (1974)  4.385185   135
4      50             Usual Suspects, The (1995)  4.370647   201
5     527                Schindler's List (1993)  4.303279   244
6    1193 One Flew Over the Cuckoo's Nest (1975)  4.256944   144