Hive - Sampling
Sampling
When you are dealing with large tables, it is more time efficient to run the analysis on the sample data first. Hive support three types of sampling - random sampling, bucket table sampling and block sampling.
Random sampling
hive> select tr_date, symbol, adjclose from stocks order by rand() limit 3;
+-------------+---------+------------+--+
| tr_date | symbol | adjclose |
+-------------+---------+------------+--+
| 2007-08-03 | PRGO | 18.657708 |
| 2013-01-09 | CLX | 67.151987 |
| 2005-02-09 | MRO | 9.270478 |
+-------------+---------+------------+--+
A more performant solution is. Distribute and sort randomizes the data at mapper and reducer end.
hive> select tr_date, symbol, adjclose from stocks distribute by rand() sort by rand() limit 3;
+-------------+---------+-------------+--+
| tr_date | symbol | adjclose |
+-------------+---------+-------------+--+
| 2015-10-05 | GD | 141.032627 |
| 2015-10-19 | FLIR | 27.324943 |
| 2001-04-16 | FLIR | 1.254247 |
+-------------+---------+-------------+--+
Bucket table sampling
This is a special sampling method, optimized for bucket tables
hive> select tr_date, symbol, adjclose from stocks_bucket TABLESAMPLE(BUCKET 1 OUT OF 4 ON symbol)
where year = 2016 limit 3;
+-------------+---------+-------------+--+
| tr_date | symbol | adjclose |
+-------------+---------+-------------+--+
| 2016-04-11 | NFX | 34.450001 |
| 2016-03-23 | ADS | 209.800003 |
| 2016-07-15 | HUM | 158.889999 |
+-------------+---------+-------------+--+
Block sampling
This type of sampling allows a query to randomly pick up n rows of data, n percentage of the data size, or n bytes of data. The sampling granularity is the HDFS block size.
hive> select tr_date, symbol, adjclose from stocks tablesample(3 rows);
+-------------+---------+------------+--+
| tr_date | symbol | adjclose |
+-------------+---------+------------+--+
| 2000-07-17 | XLNX | 74.269199 |
| 2000-07-17 | ES | 13.48614 |
| 2000-07-17 | CHK | 5.241649 |
+-------------+---------+------------+--+