Hive - CBO


Hive uses cost based optimization (CBO) of the query to speed up execution. CBO uses table statistics stored in hive metastore. These statistics are the number of rows, number of files, and raw data size of the objects in the database etc. Statistics are at the table, partition, and column level. Hive collects these statistics automatically or manually by running ANALYZE command.


To use CBO, set a few configurations

hive> SET hive.cbo.enable=true; -- default true
SET hive.compute.query.using.stats=true; -- default false
SET hive.stats.fetch.column.stats=true; -- default false
SET hive.stats.fetch.partition.stats=true; -- default true


Variations of the analyze commands.

hive> ANALYZE TABLE stocks COMPUTE STATISTICS;
ANALYZE TABLE stocks COMPUTE STATISTICS NOSCAN; -- collects only files count and their size, hence faster
ANALYZE TABLE stocks_bucket PARTITION(year) COMPUTE STATISTICS; -- collect statistics for all partitions
ANALYZE TABLE stocks_bucket COMPUTE STATISTICS FOR COLUMNS symbol, tr_date; 


Compare explain plan for each of the above statement.

hive>explain select tr_date, symbol, adjclose from stocks_bucket where year = 2016 and symbol = 'AAPL';
+---------------------------------------------------+--+
|                      Explain                      |
+---------------------------------------------------+--+
| Plan optimized by CBO.                            |
|                                                   |
| Stage-0                                           |
|   Fetch Operator                                  |
|     limit:-1                                      |
|     Select Operator [SEL_2]                       |
|       Output:["_col0","_col1","_col2"]            |
|       Filter Operator [FIL_4]                     |
|         predicate:(symbol = 'AAPL')               |
|         TableScan [TS_0]                          |
|           Output:["tr_date","adjclose","symbol"]  |
|                                                   |
+---------------------------------------------------+--+

Benefits of CBO are realized when you run complex join involving multiple tables of different size and using columns are various degrees of cardinality.


https://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/