Hive - Sampling

Sampling

When you are dealing with large tables, it is more time efficient to run the analysis on the sample data first. Hive support three types of sampling - random sampling, bucket table sampling and block sampling.

Random sampling

hive> select tr_date, symbol, adjclose from stocks order by rand() limit 3;
+-------------+---------+------------+--+
|   tr_date   | symbol  |  adjclose  |
+-------------+---------+------------+--+
| 2007-08-03  | PRGO    | 18.657708  |
| 2013-01-09  | CLX     | 67.151987  |
| 2005-02-09  | MRO     | 9.270478   |
+-------------+---------+------------+--+


A more performant solution is. Distribute and sort randomizes the data at mapper and reducer end.

hive> select tr_date, symbol, adjclose from stocks distribute by rand() sort by rand() limit 3;
+-------------+---------+-------------+--+
|   tr_date   | symbol  |  adjclose   |
+-------------+---------+-------------+--+
| 2015-10-05  | GD      | 141.032627  |
| 2015-10-19  | FLIR    | 27.324943   |
| 2001-04-16  | FLIR    | 1.254247    |
+-------------+---------+-------------+--+


Bucket table sampling

This is a special sampling method, optimized for bucket tables


hive> select tr_date, symbol, adjclose from stocks_bucket TABLESAMPLE(BUCKET 1 OUT OF 4 ON symbol)
 where year = 2016 limit 3;
+-------------+---------+-------------+--+
|   tr_date   | symbol  |  adjclose   |
+-------------+---------+-------------+--+
| 2016-04-11  | NFX     | 34.450001   |
| 2016-03-23  | ADS     | 209.800003  |
| 2016-07-15  | HUM     | 158.889999  |
+-------------+---------+-------------+--+


Block sampling

This type of sampling allows a query to randomly pick up n rows of data, n percentage of the data size, or n bytes of data. The sampling granularity is the HDFS block size.

hive> select tr_date, symbol, adjclose from stocks tablesample(3 rows);
+-------------+---------+------------+--+
|   tr_date   | symbol  |  adjclose  |
+-------------+---------+------------+--+
| 2000-07-17  | XLNX    | 74.269199  |
| 2000-07-17  | ES      | 13.48614   |
| 2000-07-17  | CHK     | 5.241649   |
+-------------+---------+------------+--+