Hive - CBO
Hive uses cost based optimization (CBO) of the query to speed up execution. CBO uses table statistics stored in hive metastore. These statistics are the number of rows, number of files, and raw data size of the objects in the database etc. Statistics are at the table, partition, and column level. Hive collects these statistics automatically or manually by running ANALYZE command.
To use CBO, set a few configurations
hive> SET hive.cbo.enable=true; -- default true
SET hive.compute.query.using.stats=true; -- default false
SET hive.stats.fetch.column.stats=true; -- default false
SET hive.stats.fetch.partition.stats=true; -- default true
Variations of the analyze commands.
hive> ANALYZE TABLE stocks COMPUTE STATISTICS;
ANALYZE TABLE stocks COMPUTE STATISTICS FOR COLUMNS; -- collect statistics for columns
ANALYZE TABLE stocks COMPUTE STATISTICS NOSCAN; -- collects only files count and their size, hence faster
ANALYZE TABLE stocks_bucket PARTITION(year) COMPUTE STATISTICS; -- collect statistics for all partitions
ANALYZE TABLE stocks_bucket COMPUTE STATISTICS FOR COLUMNS symbol, tr_date;
The following statistics currently supported for table and partitions:
Number of rows
Number of files
Size in Bytes
Number of partition if the table is partitioned
Hive calculated the statistics autograph is enabled
hive> set hive.stats.autogather=true;
Compare explain plan for each of the above statement. The query below computes the year-on year avg adjclose price for each symbol.
hive>explain
select
*
from
(select
symbol,
year(tr_date) year,
avg(adjclose) avg_2013
from
stocks
where
year(tr_date) = 2013
group by
symbol,
year(tr_date)) t1
join
(
select
symbol,
year(tr_date) year,
avg(adjclose) avg_2014
from
stocks
where
year(tr_date) = 2014
group by
symbol,
year(tr_date)
) t2
on t1.symbol = t2.symbol
and t1.year = t2.year - 1;
Query Plan
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-6 depends on stages: Stage-1, Stage-3 , consists of Stage-7, Stage-8, Stage-2
Stage-7 has a backup stage: Stage-2
Stage-4 depends on stages: Stage-7
Stage-8 has a backup stage: Stage-2
Stage-5 depends on stages: Stage-8
Stage-2
Stage-3 is a root stage
Stage-0 depends on stages: Stage-4, Stage-5, Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: stocks
Statistics: Num rows: 215855 Data size: 44898024 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((year(tr_date) = 2013) and symbol is not null) (type: boolean)
Statistics: Num rows: 53964 Data size: 11224558 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: avg(adjclose)
keys: symbol (type: string), year(tr_date) (type: int)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 53964 Data size: 11224558 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: _col1 is not null (type: boolean)
Statistics: Num rows: 26982 Data size: 5612279 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: int)
sort order: ++
Map-reduce partition columns: _col0 (type: string), _col1 (type: int)
Statistics: Num rows: 26982 Data size: 5612279 Basic stats: COMPLETE Column stats: NONE
value expressions: _col2 (type: struct<count:bigint,sum:double,input:double>)
Reduce Operator Tree:
Group By Operator
aggregations: avg(VALUE._col0)
keys: KEY._col0 (type: string), KEY._col1 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 13491 Data size: 2806139 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-6
Conditional Operator
Stage: Stage-7
Map Reduce Local Work
Alias -> Map Local Tables:
$INTNAME1
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
$INTNAME1
TableScan
HashTable Sink Operator
keys:
0 _col0 (type: string), _col1 (type: int)
1 _col0 (type: string), (_col1 - 1) (type: int)
Stage: Stage-4
Map Reduce
Map Operator Tree:
TableScan
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: string), _col1 (type: int)
1 _col0 (type: string), (_col1 - 1) (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-8
Map Reduce Local Work
Alias -> Map Local Tables:
$INTNAME
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
$INTNAME
TableScan
HashTable Sink Operator
keys:
0 _col0 (type: string), _col1 (type: int)
1 _col0 (type: string), (_col1 - 1) (type: int)
Stage: Stage-5
Map Reduce
Map Operator Tree:
TableScan
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: string), _col1 (type: int)
1 _col0 (type: string), (_col1 - 1) (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: int)
sort order: ++
Map-reduce partition columns: _col0 (type: string), _col1 (type: int)
Statistics: Num rows: 13491 Data size: 2806139 Basic stats: COMPLETE Column stats: NONE
value expressions: _col2 (type: double)
TableScan
Reduce Output Operator
key expressions: _col0 (type: string), (_col1 - 1) (type: int)
sort order: ++
Map-reduce partition columns: _col0 (type: string), (_col1 - 1) (type: int)
Statistics: Num rows: 13491 Data size: 2806139 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: int), _col2 (type: double)
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: string), _col1 (type: int)
1 _col0 (type: string), (_col1 - 1) (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: stocks
Statistics: Num rows: 215855 Data size: 44898024 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((year(tr_date) = 2014) and symbol is not null) (type: boolean)
Statistics: Num rows: 53964 Data size: 11224558 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: avg(adjclose)
keys: symbol (type: string), year(tr_date) (type: int)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 53964 Data size: 11224558 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (_col1 - 1) is not null (type: boolean)
Statistics: Num rows: 26982 Data size: 5612279 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: int)
sort order: ++
Map-reduce partition columns: _col0 (type: string), _col1 (type: int)
Statistics: Num rows: 26982 Data size: 5612279 Basic stats: COMPLETE Column stats: NONE
value expressions: _col2 (type: struct<count:bigint,sum:double,input:double>)
Reduce Operator Tree:
Group By Operator
aggregations: avg(VALUE._col0)
keys: KEY._col0 (type: string), KEY._col1 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 13491 Data size: 2806139 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Also, try hive> explain extended select ... to see additional details.
Output of the above query:
A 2013 32.428159357142846 A 2014 39.7366169325397
AA 2013 8.299071845238094 AA 2014 14.112061321428575
AAL 2013 17.78869600000001 AAL 2014 38.132203436507936
AAP 2013 84.73887314285709 AAP 2014 130.51677267460315
AAPL 2013 63.2123629920635 AAPL 2014 88.46767818253969
ABBV 2013 39.25952140079365 ABBV 2014 51.575054202380976
ABC 2013 54.608562777777784 ABC 2014 72.72405008333328
ABT 2013 33.30880767460318 ABT 2014 39.041839111111116
ACN 2013 70.59546130952383 ACN 2014 77.84725737698413
ADBE 2013 46.803412603174586 ADBE 2014 67.45305560317458
ADI 2013 42.75313803571431 ADI 2014 48.544903773809516
ADM 2013 32.63645731746032 ADM 2014 43.55579893650796
ADP 2013 56.84975079761904 ADP 2014 68.99841694444443
ADS 2013 191.86746031746037 ADS 2014 265.4146823611112
ADSK 2013 39.15007928571429 ADSK 2014 53.830000071428586
AEE 2013 30.59784315873017 AEE 2014 36.93181792063489
AEP 2013 40.76954225396826 AEP 2014 48.832750234127005
AES 2013 11.827752503968249 AES 2014 13.292465710317478
AET 2013 57.41888040476189 AET 2014 76.42373734126986
AFL 2013 53.827321797619014 AFL 2014 58.39401914682539
AGN 2013 122.96440459126977 AGN 2014 220.02797637698416
AIG 2013 42.92983638492063 AIG 2014 51.194616666666654
AIV 2013 26.249909853174604 AIV 2014 30.359743226190485
AIZ 2013 48.386520777777754 AIZ 2014 63.50188482936509
AJG 2013 38.94890921428571 AJG 2014 43.30382427777779
...
Benefits of CBO are realised when you run complex join involving multiple tables of different size and using columns are various degrees of cardinality.
https://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/