Hive - Sampling

Sampling

When you are dealing with large tables, it is more time efficient to run the analysis on the sample data first. Hive support three types of sampling - random sampling, bucket table sampling and block sampling.

Random sampling

hive> select tr_date, symbol, adjclose from stocks order by rand() limit 3;

+-------------+---------+------------+--+

| tr_date | symbol | adjclose |

+-------------+---------+------------+--+

| 2007-08-03 | PRGO | 18.657708 |

| 2013-01-09 | CLX | 67.151987 |

| 2005-02-09 | MRO | 9.270478 |

+-------------+---------+------------+--+

A more performant solution is. Distribute and sort randomizes the data at mapper and reducer end.

hive> select tr_date, symbol, adjclose from stocks distribute by rand() sort by rand() limit 3;

+-------------+---------+-------------+--+

| tr_date | symbol | adjclose |

+-------------+---------+-------------+--+

| 2015-10-05 | GD | 141.032627 |

| 2015-10-19 | FLIR | 27.324943 |

| 2001-04-16 | FLIR | 1.254247 |

+-------------+---------+-------------+--+

Bucket table sampling

This is a special sampling method, optimized for bucket tables

hive> select tr_date, symbol, adjclose from stocks_bucket TABLESAMPLE(BUCKET 1 OUT OF 4 ON symbol)

where year = 2016 limit 3;

+-------------+---------+-------------+--+

| tr_date | symbol | adjclose |

+-------------+---------+-------------+--+

| 2016-04-11 | NFX | 34.450001 |

| 2016-03-23 | ADS | 209.800003 |

| 2016-07-15 | HUM | 158.889999 |

+-------------+---------+-------------+--+

Block sampling

This type of sampling allows a query to randomly pick up n rows of data, n percentage of the data size, or n bytes of data. The sampling granularity is the HDFS block size.

hive> select tr_date, symbol, adjclose from stocks tablesample(3 rows);

+-------------+---------+------------+--+

| tr_date | symbol | adjclose |

+-------------+---------+------------+--+

| 2000-07-17 | XLNX | 74.269199 |

| 2000-07-17 | ES | 13.48614 |

| 2000-07-17 | CHK | 5.241649 |

+-------------+---------+------------+--+