Hive Join Example
Goal: example of join operation in hive on csv file.
Download movielens dataset http://grouplens.org/datasets/movielens/.
Unzip the files and put each file in their own directory. Hive requires directory as location rather than file as location.
Create a table based on the movies data using delimited file.
hive> create external table movies (
movieId int,
title string,
genres string)
row format delimited
fields terminated by ','
lines terminated by '\n'
location '/user/cloudera/movielens/movies'
tblproperties ("skip.header.line.count"="1");
Do you see the problem with the bad data, for example with movieId = 11? When the column values contain commas, the parsing does not work. How to fix this? Hive serde can come to rescue.
Create a table using hive serde. Drop the table if it already exists.
hive> DROP TABLE IF EXISTS movies;
hive> CREATE EXTERNAL TABLE movies(
movieId string,
title string,
genres string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"escapeChar" = "\\"
)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/cloudera/movielens/movies'
TBLPROPERTIES (
'serialization.null.format' = '',
'skip.header.line.count' = '1')
;
Note: Converts all column type values to STRING. Even if you define any column as int or something else, but still it converts data type back to STRING. To convert to other data type use "cast" UDF function.
More information about CSV serde https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
Same way create a table on the ratings data.
hive> CREATE EXTERNAL TABLE ratings(
userId string,
movieId string,
rating double,
`timestamp` bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"escapeChar" = "\\"
)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/cloudera/movielens/ratings'
TBLPROPERTIES (
'serialization.null.format' = '',
'skip.header.line.count' = '1');
Note: timestamp is a reserved word in certain distribution. So, enclosing with back tick
Now join the two tables.
hive> SET hive.auto.convert.join=false;
hive> SELECT movieid, title, avg(rating) from movies left join ratings on movies.movieid = ratings.movieid group by movies.movieid, title;
Setting hive.auto.convert.join disable the auto map join feaure. Know more https://cwiki.apache.org/confluence/display/hive/languagemanual+joinoptimization
Find which day of the week most of ratings are posted.
hive> select from_unixtime(cast(`timestamp` as bigint), 'E') weekday from ratings limit 10;
hive> select from_unixtime(cast(`timestamp` as bigint), 'E') weekday, count(*) from ratings group by from_unixtime(cast(`timestamp` as bigint), 'E');
See available date formats here.
Movie Recommendation Based on Genres
Create exploded view of movie id and genre
hive> create view movie_by_genre as select movieid, genre from (select movieid, split(genres, '\\|') genres from movies) t lateral view explode(genres) t as genre;
Find for each user, the rank of genre. Ideally you would like to compute weighted
hive> create temporary table movie_by_user_genre as
select t1.*, t2.rating,t2.userid from movie_by_genre t1
left join ratings t2 on t1.movieid = t2.movieid where t2.rating >= 4;
Create a temporary view for user and his total ratings by genre
hive> create temporary table user_by_genre_totalrating as select userid, genre, sum(rating) total_rating from movie_by_user_genre group by userid, genre
Find top 3 genres for each user and create a temporary table for that.
hive> create temporary table user_top3_genres as select * from
(select userid, genre, row_number() over (partition by userid order by total_rating desc) row_num from user_by_genre_totalrating) t where t.row_num <= 3
Create temporary table that contain for movie and genre combination the average rating.
hive> create temporary table genre_movie_avgrating as
select t1.movieid, t1.genre, avg(t2.rating) as avg_rating from movie_by_genre t1 left join ratings t2 on t1.movieid = t2.movieid group by t1.movieid, t1.genre
Create a temporary table for top 10 movies for each genre.
hive> create temporary table top10movies_by_genre as
select * from (select genre, movieid, avg_rating, row_number() over (partition by genre order by avg_rating desc) as row_num from genre_movie_avgrating) t1 where t1.row_num <= 10
Find movies 10 movies recommendation for each user
hive> create temporary table user_genre_avgrating as
select t1.userid, t1.genre, t2.movieid, t2.avg_rating from user_top3_genres t1 left join top10movies_by_genre t2 on t1.genre = t2.genre
hive> create temporary table user_movie_reco as
select t1.*, t2.title, t2.genres from (select userid, movieid, genre, avg_rating, row_number() over (partition by userid order by avg_rating desc) row_num from user_genre_avgrating) t1 join movies t2 on t1.movieid = t2.movieid where t1.row_num <= 10
View the recommendation
hive> select userid, title, genre, avg_rating, genres from user_movie_reco
CTAS has the following properties
-- cannot be a partitioned table
-- cannot be an external table
-- cannot be a list-bucketing table
Save the output of a query to a HDFS directory
hive> INSERT OVERWRITE DIRECTORY '/user/cloudera/movies_action'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
NULL DEFINED AS ''
SELECT * FROM movies WHERE genres like '%Action%'
View the files created in HDFS
$ hadoop fs -ls /user/cloudera/movies_action
Found 3 items
drwxr-xr-x - cloudera cloudera 1 2018-08-28 03:11 /user/cloudera/movies_action/.hive-staging_hive_2018-08-28_03-11-47_260_3940892981476728227-1
-rwxr-xr-x 1 cloudera cloudera 76897 2018-08-28 03:11 /user/cloudera/movies_action/000000_0
drwx------ - cloudera cloudera 1 2018-08-28 03:11 /user/cloudera/movies_action/_scratchdir_hive_2018-08-28_03-11-47_260_3940892981476728227-13
View the content of the output
$ hadoop fs -cat /user/mapr/movies_action/0* | head -n 5
6,Heat (1995),Action|Crime|Thriller
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller
15,Cutthroat Island (1995),Action|Adventure|Romance
20,Money Train (1995),Action|Comedy|Crime|Drama|Thriller