Hive Bucketing Example
Prerequisite: create stocks table as describe in the first portion of this article.
Create table by with partition by year and bucket on symbol column.
hive> CREATE EXTERNAL TABLE `stocks_bucket`(
`tr_date` string,
`open` double,
`high` double,
`low` double,
`close` double,
`volume` double,
`adjclose` double,
`symbol` string)
PARTITIONED BY(year int)
CLUSTERED BY (`symbol`) INTO 4 BUCKETS;
Before loading data into bucketed table, set the right number of reducers or set the hive configuration that will determine the number reducers by its own.
hive> set map.reduce.tasks = 4; -- This number should be equal to the number of buckets
set hive.exec.dynamic.partition=true; -- Required if you want to create dynamic partitioning using insert into statement
set hive.enforce.bucketing = true; -- This is recommended
Populate the stocks_bucket table with the data from stocks table
INSERT INTO `stocks_bucket`
PARTITION (year)
SELECT
`tr_date`,
`open`,
`high`,
`low`,
`close`,
`volume`,
`adjclose`,
`symbol`,
year(tr_date) FROM stocks
View the directory of stocks_bucket table.
$ hadoop fs -ls /user/hive/warehouse/stocks_bucket
Found 19 items
drwxrwxrwt - cloudera cloudera 1 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/.hive-staging_hive_2018-08-28_03-27-52_161_558394225296515089-1
drwx------ - cloudera cloudera 1 2018-08-28 03:27 /user/hive/warehouse/stocks_bucket/_scratchdir_hive_2018-08-28_03-27-52_161_558394225296515089-14
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2000
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2001
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2002
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2003
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2004
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2005
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2006
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2007
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2008
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2009
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2010
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2011
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2012
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2013
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2014
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2015
drwxrwxrwt - cloudera cloudera 4 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2016
View the directory for a given partition. Since we created 4 buckets based on the stocks symbol, stocks symbols will be exclusively distributed among 4 files.
$ hadoop fs -ls /user/hive/warehouse/stocks_bucket/year=2000
Found 4 items
-rwxrwxrwt 1 cloudera cloudera 812167 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2000/000000_0
-rwxrwxrwt 1 cloudera cloudera 652720 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2000/000001_0
-rwxrwxrwt 1 cloudera cloudera 851500 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2000/000002_0
-rwxrwxrwt 1 cloudera cloudera 735537 2018-08-28 03:28 /user/hive/warehouse/stocks_bucket/year=2000/000003_0
$ hadoop fs -cat /user/hive/warehouse/stocks_bucket/year=2000/000000_0 | head
2000-08-2213.37513.87513.37513.5222700.013.006862PXD
2000-08-2225.75001926.62525.43748126.6252544000.03.278705CSX
2000-11-2941.06248141.06248140.2540.9375191.23904E78.546621LOW
2000-08-2221.25000121.31250420.25000620.6874946013900.09.815813USB
2000-11-2966.812567.12563.2565.06259.82804E722.376616MSFT
2000-11-2966.12496267.43752366.12496267.1249627696800.07.741655AET
2000-11-2950.551.049.87550.0795400.019.656488TXT
2000-11-29134.5139.375133.25138.25775200.034.5625LH
2000-11-2914.37514.514.2514.4375963800.013.910117PXD
2000-11-2926.31251926.87501926.31251926.5624813726600.03.347279CSX
Since we have not specified our own field delimiter, hive used its default.
What is right number of bucket?
Each bucket should have approximately 512 MB of data. Try to choose the number of bucket ax 2 ^ n that 4, 8, 16 etc.
Benefits of bucketing
By using buckets, HQL query can easily and efficiently do data aggregation, sampling), bucket-side joins, and map-side joins (also known as hash join) data correlation and scope.
hive> select tr_date, symbol, adjclose from stocks_bucket TABLESAMPLE(BUCKET 1 OUT OF 4 ON symbol)
where year = 2016 limit 3;
+-------------+---------+-------------+--+
| tr_date | symbol | adjclose |
+-------------+---------+-------------+--+
| 2016-04-11 | NFX | 34.450001 |
| 2016-03-23 | ADS | 209.800003 |
| 2016-07-15 | HUM | 158.889999 |
+-------------+---------+-------------+--+
Which column you should bucket by?
Choose the column that is often used for joining with other tables, for example customer id, employee id, product id etc.
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables