Hive - Table Partitions

Goal: We want to create a partitioned table based on year and month.

Create a hive table based on raw logs from apache web server. You can follow this blog to find out the steps.

http://blog.einext.com/hadoop/hive-table-using-regex-serde

hive> select host, request, size from weblogs limit 10;
+----------------+-------------------------------+--------+--+
|      host      |             time              |  size  |
+----------------+-------------------------------+--------+--+
| 66.249.67.3    | [20/Jul/2009:20:12:22 -0700]  | 5      |
| 66.249.67.3    | [20/Jul/2009:20:12:25 -0700]  | 8068   |
| 64.233.172.17  | [20/Jul/2009:20:12:26 -0700]  | 2969   |
| 74.125.74.193  | [20/Jul/2009:20:13:01 -0700]  | 2969   |
| 192.168.1.198  | [20/Jul/2009:20:13:18 -0700]  | 17935  |
| 192.168.1.198  | [20/Jul/2009:20:13:18 -0700]  | 1504   |
| 192.168.1.198  | [20/Jul/2009:20:13:19 -0700]  | 146    |
| 66.249.67.3    | [20/Jul/2009:20:13:21 -0700]  | 5      |
| 66.249.67.3    | [20/Jul/2009:20:13:24 -0700]  | 8196   |
| 66.249.67.3    | [20/Jul/2009:20:13:29 -0700]  | 6360   |
+----------------+-------------------------------+--------+--+


Create a statement to convert convert the date time stamp to a UTC datetime string.

hive> SELECT t1.*, from_unixtime(UNIX_TIMESTAMP(t1.time, '[dd/MMM/yyyy:HH:mm:ss Z]')) AS `ds` FROM weblogs t1;

Use the following command to view table creation statement for the view accesslogs_view, but this time, we want add partition information to it.

hive> show create table weblogs;

Add the partitioning details to the table creation statement like below. Then run the command to create the table in hive.

hive> CREATE TABLE `weblogs_partitioned`(
  `host` string, 
  `identity` string, 
  `user` string, 
  `time` string, 
  `request` string, 
  `status` string, 
  `size` string, 
  `referrer` string, 
  `useragent` string)
PARTITIONED BY(year int, month int, day int)
STORED AS PARQUET;

Enable dynamic partitioning.

hive> SET hive.exec.dynamic.partition=true;
SET hive.exec.max.dynamic.partitions=10024;
SET hive.exec.max.dynamic.partitions.pernode=10024;
SET hive.exec.dynamic.partition.mode=non-strict;

Run the following statement to load the data into partitioned table.

hive> INSERT INTO TABLE weblogs_partitioned 
PARTITION (year,month, day)
SELECT host
    ,identity
    ,`user`
    ,TIME
    ,request
    ,STATUS
    ,size
    ,referrer
    ,useragent
    ,year(ds)
    ,month(ds)
    ,day(ds) 
FROM (SELECT *,  from_unixtime(UNIX_TIMESTAMP(TIME, '[dd/MMM/yyyy:HH:mm:ss Z]')) AS ds 
       FROM weblogs) t;

Verify the dir structure in the HDFS. Drill down to a few files to check.

$ hadoop fs -ls -R /user/hive/warehouse/weblogs_partitioned
drwxrwxrwx   - cloudera supergroup          0 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009
drwxrwxrwx   - cloudera supergroup          0 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009/month=10
drwxrwxrwx   - cloudera supergroup          0 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009/month=10/day=1
-rwxrwxrwx   1 cloudera supergroup     447901 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009/month=10/day=1/000001_0
drwxrwxrwx   - cloudera supergroup          0 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009/month=7
drwxrwxrwx   - cloudera supergroup          0 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009/month=7/day=20
-rwxrwxrwx   1 cloudera supergroup     337404 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009/month=7/day=20/000005_0
drwxrwxrwx   - cloudera supergroup          0 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009/month=7/day=21
-rwxrwxrwx   1 cloudera supergroup    2202305 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009/month=7/day=21/000005_0
drwxrwxrwx   - cloudera supergroup          0 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009/month=7/day=22
-rwxrwxrwx   1 cloudera supergroup    2221714 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009/month=7/day=22/000005_0
drwxrwxrwx   - cloudera supergroup          0 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009/month=7/day=23
-rwxrwxrwx   1 cloudera supergroup    2123367 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009/month=7/day=23/000005_0
drwxrwxrwx   - cloudera supergroup          0 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009/month=7/day=24
-rwxrwxrwx   1 cloudera supergroup    1631330 2016-08-30 10:20 /user/hive/warehouse/weblogs_partitioned/year=2009/month=7/day=24/000005_0
.... (truncated output)

To see the performance difference due to table partition, compare the execution time of the following two queries

hive> select count(*) from weblogs_partitioned where year = 2009 and month = 10 and day = 1;
hive> select count(*) from (SELECT t1.*, from_unixtime(UNIX_TIMESTAMP(t1.time, '[dd/MMM/yyyy:HH:mm:ss Z]')) AS `ds` FROM weblogs t1) t2 where year(t2.ds) = 2009 and month(t2.ds) = 10 and day(t2.ds) = 1;


Compare the number of bytes read from HDFS to run a sample query - find distinct ip addresses for year = 2009 and month = 7

Query on the non_partitioned table

hive> SELECT count(DISTINCT host) FROM weblogs 
where year(from_unixtime(UNIX_TIMESTAMP(time, '[dd/MMM/yyyy:HH:mm:ss Z]'))) = 2009 and 
 month(from_unixtime(UNIX_TIMESTAMP(time, '[dd/MMM/yyyy:HH:mm:ss Z]'))) = 7

HDFS Number of bytes read 5,303,409


Query on the partitioned table

hive> SELECT count(DISTINCT host) FROM weblogs_partitioned where year = 2009 and month = 7

HDFS Number of bytes read 196,398

The number of bytes reads for the same query on the partitioned tabled is much less.

You can find the "Number of bytes read" counter from job history server.