04 Joining Datasets using R
For illustration we will use movielens dataset.
Please download small dataset, ~ 1MB in size, from http://grouplens.org/datasets/movielens/.
Look at readme.html for data description.
Unzip the file and set your R working directory to the directory of unzipped files.
> setwd("/data/movie_lens/ml-latest-small")
We want to display name and average rating for each movie.
Load movies and ratings into 2 data frames.
> movies = read.csv("movies.csv")
> ratings = read.csv("ratings.csv")
View the structure
> str(movies)
'data.frame': 10329 obs. of 3 variables:
$ movieId: int 1 2 3 4 5 6 7 8 9 10 ...
$ title : Factor w/ 10327 levels "¡Three Amigos! (1986)",..: 9391 4930 3876 9797 3148 4072 7852 9335 8741 3723 ...
$ genres : Factor w/ 938 levels "(no genres listed)",..: 338 396 698 651 601 245 698 380 2 123 ...
This dataset has 10329 rows.
> str(ratings)
'data.frame': 105339 obs. of 4 variables:
$ userId : int 1 1 1 1 1 1 1 1 1 1 ...
$ movieId : int 16 24 32 47 50 110 150 161 165 204 ...
$ rating : num 4 1.5 4 4 4 4 3 4 3 0.5 ...
$ timestamp: int 1217897793 1217895807 1217896246 1217896556 1217896523 1217896150 1217895940 1217897864 1217897135 1217895786 ...
This dataset has 105,339 rows.
Find whether is any duplicate movieId in movies dataset.
> movieId.freq = as.data.frame(table(movies$movieId))
> any(movieId.freq$Freq > 1)
[1] FALSE
It shows that there is no duplicate in movieId field.
Intuition is there is going to duplicate for movieId in ratings table as it is a fact table.
Find out average count of ratings for movies.
> ratings.movieId = table(ratings$movieId)
> mean(ratings.movieId) # should be 10.20
Find all movies with "Rocky" in title
> movies[grep("rocky", movies$title, ignore.case = TRUE), ]
movieId title genres
1521 1954 Rocky (1976) Drama
1919 2409 Rocky II (1979) Action|Drama
1920 2410 Rocky III (1982) Action|Drama
1921 2411 Rocky IV (1985) Action|Drama
1922 2412 Rocky V (1990) Action|Drama
2122 2657 Rocky Horror Picture Show, The (1975) Comedy|Horror|Musical|Sci-Fi
2968 3754 Adventures of Rocky and Bullwinkle, The (2000) Adventure|Animation|Children|Comedy|Fantasy
7222 49651 Rocky Balboa (2006) Action|Drama
9089 89045 Rocky VI (1986) Comedy
How many ratings are there for "Rocky IV" that is with movieId = 2411.
> mean(ratings[ratings$movieId == "2411", ]$rating)
Now, let's join the movie.
> ratings.merged = merge(ratings, movies, by = "movieId")
> head(ratings.merged)
movieId userId rating timestamp title genres
1 1 103 3.0 978554606 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 1 328 5.0 1237950888 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
3 1 453 4.0 859302647 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
4 1 590 3.5 1243424645 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
5 1 419 4.0 1447594754 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
6 1 180 3.5 1180846101 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
To test the joined data, let's find the average rating from joined data. It should match with above.
> mean(ratings.merged$rating[ratings.merged$movieId == "2411"])
More efficient option could have been find average rating from ratings table for each movieid then join.
Find top 6 movies based on highest average rating. Consider those movies that have received at least 100 ratings in total.
> require(plyr)
> movies = read.csv("data/ml-latest-small/movies.csv")
> ratings = read.csv("data/ml-latest-small/ratings.csv")
> df = merge(x = movies, y = ratings, by.x = "movieId", by.y = "movieId")
> df = ddply(df, c("movieId", "title"), function(x) c(avgRating = mean(x$rating), count = nrow(x)))
> df = subset(df, count > 100)
> df = arrange(df, desc(avgRating))
> head(df)
movieId title avgRating count
1 858 Godfather, The (1972) 4.487500 200
2 318 Shawshank Redemption, The (1994) 4.487138 311
3 1221 Godfather: Part II, The (1974) 4.385185 135
4 50 Usual Suspects, The (1995) 4.370647 201
5 527 Schindler's List (1993) 4.303279 244
6 1193 One Flew Over the Cuckoo's Nest (1975) 4.256944 144