Hive - CBO

Hive uses cost based optimization (CBO) of the query to speed up execution. CBO uses table statistics stored in hive metastore. These statistics are the number of rows, number of files, and raw data size of the objects in the database etc. Statistics are at the table, partition, and column level. Hive collects these statistics automatically or manually by running ANALYZE command.

To use CBO, set a few configurations

hive> SET hive.cbo.enable=true; -- default true

SET hive.compute.query.using.stats=true; -- default false

SET hive.stats.fetch.column.stats=true; -- default false

SET hive.stats.fetch.partition.stats=true; -- default true

Variations of the analyze commands.

hive> ANALYZE TABLE stocks COMPUTE STATISTICS;

ANALYZE TABLE stocks COMPUTE STATISTICS FOR COLUMNS; -- collect statistics for columns

ANALYZE TABLE stocks COMPUTE STATISTICS NOSCAN; -- collects only files count and their size, hence faster

ANALYZE TABLE stocks_bucket PARTITION(year) COMPUTE STATISTICS; -- collect statistics for all partitions

ANALYZE TABLE stocks_bucket COMPUTE STATISTICS FOR COLUMNS symbol, tr_date;

The following statistics currently supported for table and partitions:

Number of rows
Number of files
Size in Bytes
Number of partition if the table is partitioned

Hive calculated the statistics autograph is enabled

hive> set hive.stats.autogather=true;

Compare explain plan for each of the above statement. The query below computes the year-on year avg adjclose price for each symbol.

hive>explain

select

from

(select

symbol,

year(tr_date) year,

avg(adjclose) avg_2013

from

stocks

where

year(tr_date) = 2013

group by

symbol,

year(tr_date)) t1

join

(

select

symbol,

year(tr_date) year,

avg(adjclose) avg_2014

from

stocks

where

year(tr_date) = 2014

group by

symbol,

year(tr_date)

) t2

on t1.symbol = t2.symbol

and t1.year = t2.year - 1;

Query Plan

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-6 depends on stages: Stage-1, Stage-3 , consists of Stage-7, Stage-8, Stage-2

Stage-7 has a backup stage: Stage-2

Stage-4 depends on stages: Stage-7

Stage-8 has a backup stage: Stage-2

Stage-5 depends on stages: Stage-8

Stage-2

Stage-3 is a root stage

Stage-0 depends on stages: Stage-4, Stage-5, Stage-2

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan

alias: stocks

Statistics: Num rows: 215855 Data size: 44898024 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: ((year(tr_date) = 2013) and symbol is not null) (type: boolean)

Statistics: Num rows: 53964 Data size: 11224558 Basic stats: COMPLETE Column stats: NONE

Group By Operator

aggregations: avg(adjclose)

keys: symbol (type: string), year(tr_date) (type: int)

mode: hash

outputColumnNames: _col0, _col1, _col2

Statistics: Num rows: 53964 Data size: 11224558 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: _col1 is not null (type: boolean)

Statistics: Num rows: 26982 Data size: 5612279 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: _col0 (type: string), _col1 (type: int)

sort order: ++

Map-reduce partition columns: _col0 (type: string), _col1 (type: int)

Statistics: Num rows: 26982 Data size: 5612279 Basic stats: COMPLETE Column stats: NONE

value expressions: _col2 (type: struct<count:bigint,sum:double,input:double>)

Reduce Operator Tree:

Group By Operator

aggregations: avg(VALUE._col0)

keys: KEY._col0 (type: string), KEY._col1 (type: int)

mode: mergepartial

outputColumnNames: _col0, _col1, _col2

Statistics: Num rows: 13491 Data size: 2806139 Basic stats: COMPLETE Column stats: NONE

File Output Operator

compressed: false

table:

input format: org.apache.hadoop.mapred.SequenceFileInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

Stage: Stage-6

Conditional Operator

Stage: Stage-7

Map Reduce Local Work

Alias -> Map Local Tables:

$INTNAME1

Fetch Operator

limit: -1

Alias -> Map Local Operator Tree:

$INTNAME1

TableScan

HashTable Sink Operator

keys:

0 _col0 (type: string), _col1 (type: int)

1 _col0 (type: string), (_col1 - 1) (type: int)

Stage: Stage-4

Map Reduce

Map Operator Tree:

TableScan

Map Join Operator

condition map:

Inner Join 0 to 1

keys:

0 _col0 (type: string), _col1 (type: int)

1 _col0 (type: string), (_col1 - 1) (type: int)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5

Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE

File Output Operator

compressed: false

Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Local Work:

Map Reduce Local Work

Stage: Stage-8

Map Reduce Local Work

Alias -> Map Local Tables:

$INTNAME

Fetch Operator

limit: -1

Alias -> Map Local Operator Tree:

$INTNAME

TableScan

HashTable Sink Operator

keys:

0 _col0 (type: string), _col1 (type: int)

1 _col0 (type: string), (_col1 - 1) (type: int)

Stage: Stage-5

Map Reduce

Map Operator Tree:

TableScan

Map Join Operator

condition map:

Inner Join 0 to 1

keys:

0 _col0 (type: string), _col1 (type: int)

1 _col0 (type: string), (_col1 - 1) (type: int)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5

Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE

File Output Operator

compressed: false

Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Local Work:

Map Reduce Local Work

Stage: Stage-2

Map Reduce

Map Operator Tree:

TableScan

Reduce Output Operator

key expressions: _col0 (type: string), _col1 (type: int)

sort order: ++

Map-reduce partition columns: _col0 (type: string), _col1 (type: int)

Statistics: Num rows: 13491 Data size: 2806139 Basic stats: COMPLETE Column stats: NONE

value expressions: _col2 (type: double)

TableScan

Reduce Output Operator

key expressions: _col0 (type: string), (_col1 - 1) (type: int)

sort order: ++

Map-reduce partition columns: _col0 (type: string), (_col1 - 1) (type: int)

Statistics: Num rows: 13491 Data size: 2806139 Basic stats: COMPLETE Column stats: NONE

value expressions: _col1 (type: int), _col2 (type: double)

Reduce Operator Tree:

Join Operator

condition map:

Inner Join 0 to 1

keys:

0 _col0 (type: string), _col1 (type: int)

1 _col0 (type: string), (_col1 - 1) (type: int)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5

Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE

File Output Operator

compressed: false

Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-3

Map Reduce

Map Operator Tree:

TableScan

alias: stocks

Statistics: Num rows: 215855 Data size: 44898024 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: ((year(tr_date) = 2014) and symbol is not null) (type: boolean)

Statistics: Num rows: 53964 Data size: 11224558 Basic stats: COMPLETE Column stats: NONE

Group By Operator

aggregations: avg(adjclose)

keys: symbol (type: string), year(tr_date) (type: int)

mode: hash

outputColumnNames: _col0, _col1, _col2

Statistics: Num rows: 53964 Data size: 11224558 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: (_col1 - 1) is not null (type: boolean)

Statistics: Num rows: 26982 Data size: 5612279 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: _col0 (type: string), _col1 (type: int)

sort order: ++

Map-reduce partition columns: _col0 (type: string), _col1 (type: int)

Statistics: Num rows: 26982 Data size: 5612279 Basic stats: COMPLETE Column stats: NONE

value expressions: _col2 (type: struct<count:bigint,sum:double,input:double>)

Reduce Operator Tree:

Group By Operator

aggregations: avg(VALUE._col0)

keys: KEY._col0 (type: string), KEY._col1 (type: int)

mode: mergepartial

outputColumnNames: _col0, _col1, _col2

Statistics: Num rows: 13491 Data size: 2806139 Basic stats: COMPLETE Column stats: NONE

File Output Operator

compressed: false

table:

input format: org.apache.hadoop.mapred.SequenceFileInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

Stage: Stage-0

Fetch Operator

limit: -1

Processor Tree:

ListSink

Also, try hive> explain extended select ... to see additional details.

Output of the above query:

A 2013 32.428159357142846 A 2014 39.7366169325397

AA 2013 8.299071845238094 AA 2014 14.112061321428575

AAL 2013 17.78869600000001 AAL 2014 38.132203436507936

AAP 2013 84.73887314285709 AAP 2014 130.51677267460315

AAPL 2013 63.2123629920635 AAPL 2014 88.46767818253969

ABBV 2013 39.25952140079365 ABBV 2014 51.575054202380976

ABC 2013 54.608562777777784 ABC 2014 72.72405008333328

ABT 2013 33.30880767460318 ABT 2014 39.041839111111116

ACN 2013 70.59546130952383 ACN 2014 77.84725737698413

ADBE 2013 46.803412603174586 ADBE 2014 67.45305560317458

ADI 2013 42.75313803571431 ADI 2014 48.544903773809516

ADM 2013 32.63645731746032 ADM 2014 43.55579893650796

ADP 2013 56.84975079761904 ADP 2014 68.99841694444443

ADS 2013 191.86746031746037 ADS 2014 265.4146823611112

ADSK 2013 39.15007928571429 ADSK 2014 53.830000071428586

AEE 2013 30.59784315873017 AEE 2014 36.93181792063489

AEP 2013 40.76954225396826 AEP 2014 48.832750234127005

AES 2013 11.827752503968249 AES 2014 13.292465710317478

AET 2013 57.41888040476189 AET 2014 76.42373734126986

AFL 2013 53.827321797619014 AFL 2014 58.39401914682539

AGN 2013 122.96440459126977 AGN 2014 220.02797637698416

AIG 2013 42.92983638492063 AIG 2014 51.194616666666654

AIV 2013 26.249909853174604 AIV 2014 30.359743226190485

AIZ 2013 48.386520777777754 AIZ 2014 63.50188482936509

AJG 2013 38.94890921428571 AJG 2014 43.30382427777779

...

Benefits of CBO are realised when you run complex join involving multiple tables of different size and using columns are various degrees of cardinality.

https://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/