Hive - CBO


Hive uses cost based optimization (CBO) of the query to speed up execution. CBO uses table statistics stored in hive metastore. These statistics are the number of rows, number of files, and raw data size of the objects in the database etc. Statistics are at the table, partition, and column level. Hive collects these statistics automatically or manually by running ANALYZE command.


To use CBO, set a few configurations

hive> SET hive.cbo.enable=true; -- default true
SET hive.compute.query.using.stats=true; -- default false
SET hive.stats.fetch.column.stats=true; -- default false
SET hive.stats.fetch.partition.stats=true; -- default true


Variations of the analyze commands.

hive> ANALYZE TABLE stocks COMPUTE STATISTICS;
ANALYZE TABLE stocks COMPUTE STATISTICS FOR COLUMNS; -- collect statistics for columns
ANALYZE TABLE stocks COMPUTE STATISTICS NOSCAN; -- collects only files count and their size, hence faster
ANALYZE TABLE stocks_bucket PARTITION(year) COMPUTE STATISTICS; -- collect statistics for all partitions
ANALYZE TABLE stocks_bucket COMPUTE STATISTICS FOR COLUMNS symbol, tr_date; 


The following statistics currently supported for table and partitions:

  • Number of rows
  • Number of files
  • Size in Bytes
  • Number of partition if the table is partitioned

Hive calculated the statistics autograph is enabled

hive> set hive.stats.autogather=true;


Compare explain plan for each of the above statement. The query below computes the year-on year avg adjclose price for each symbol.

hive>explain
    select
        * 
    from
        (select
            symbol,
            year(tr_date) year,
            avg(adjclose) avg_2013 
        from
            stocks 
        where
            year(tr_date) = 2013 
        group by
            symbol,
            year(tr_date)) t1 
    join
        (
            select
                symbol,
                year(tr_date) year,
                avg(adjclose) avg_2014 
            from
                stocks 
            where
                year(tr_date) = 2014 
            group by
                symbol,
                year(tr_date)
        ) t2 
            on t1.symbol = t2.symbol 
            and t1.year = t2.year - 1;


Query Plan

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-6 depends on stages: Stage-1, Stage-3 , consists of Stage-7, Stage-8, Stage-2
  Stage-7 has a backup stage: Stage-2
  Stage-4 depends on stages: Stage-7
  Stage-8 has a backup stage: Stage-2
  Stage-5 depends on stages: Stage-8
  Stage-2
  Stage-3 is a root stage
  Stage-0 depends on stages: Stage-4, Stage-5, Stage-2

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: stocks
            Statistics: Num rows: 215855 Data size: 44898024 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((year(tr_date) = 2013) and symbol is not null) (type: boolean)
              Statistics: Num rows: 53964 Data size: 11224558 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: avg(adjclose)
                keys: symbol (type: string), year(tr_date) (type: int)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 53964 Data size: 11224558 Basic stats: COMPLETE Column stats: NONE
                Filter Operator
                  predicate: _col1 is not null (type: boolean)
                  Statistics: Num rows: 26982 Data size: 5612279 Basic stats: COMPLETE Column stats: NONE
                  Reduce Output Operator
                    key expressions: _col0 (type: string), _col1 (type: int)
                    sort order: ++
                    Map-reduce partition columns: _col0 (type: string), _col1 (type: int)
                    Statistics: Num rows: 26982 Data size: 5612279 Basic stats: COMPLETE Column stats: NONE
                    value expressions: _col2 (type: struct<count:bigint,sum:double,input:double>)
      Reduce Operator Tree:
        Group By Operator
          aggregations: avg(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 13491 Data size: 2806139 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-6
    Conditional Operator

  Stage: Stage-7
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $INTNAME1 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $INTNAME1 
          TableScan
            HashTable Sink Operator
              keys:
                0 _col0 (type: string), _col1 (type: int)
                1 _col0 (type: string), (_col1 - 1) (type: int)

  Stage: Stage-4
    Map Reduce
      Map Operator Tree:
          TableScan
            Map Join Operator
              condition map:
                   Inner Join 0 to 1
              keys:
                0 _col0 (type: string), _col1 (type: int)
                1 _col0 (type: string), (_col1 - 1) (type: int)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
              Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-8
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $INTNAME 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $INTNAME 
          TableScan
            HashTable Sink Operator
              keys:
                0 _col0 (type: string), _col1 (type: int)
                1 _col0 (type: string), (_col1 - 1) (type: int)

  Stage: Stage-5
    Map Reduce
      Map Operator Tree:
          TableScan
            Map Join Operator
              condition map:
                   Inner Join 0 to 1
              keys:
                0 _col0 (type: string), _col1 (type: int)
                1 _col0 (type: string), (_col1 - 1) (type: int)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
              Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: string), _col1 (type: int)
              sort order: ++
              Map-reduce partition columns: _col0 (type: string), _col1 (type: int)
              Statistics: Num rows: 13491 Data size: 2806139 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col2 (type: double)
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: string), (_col1 - 1) (type: int)
              sort order: ++
              Map-reduce partition columns: _col0 (type: string), (_col1 - 1) (type: int)
              Statistics: Num rows: 13491 Data size: 2806139 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col1 (type: int), _col2 (type: double)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 _col0 (type: string), _col1 (type: int)
            1 _col0 (type: string), (_col1 - 1) (type: int)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
          Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 14840 Data size: 3086752 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: stocks
            Statistics: Num rows: 215855 Data size: 44898024 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((year(tr_date) = 2014) and symbol is not null) (type: boolean)
              Statistics: Num rows: 53964 Data size: 11224558 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: avg(adjclose)
                keys: symbol (type: string), year(tr_date) (type: int)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 53964 Data size: 11224558 Basic stats: COMPLETE Column stats: NONE
                Filter Operator
                  predicate: (_col1 - 1) is not null (type: boolean)
                  Statistics: Num rows: 26982 Data size: 5612279 Basic stats: COMPLETE Column stats: NONE
                  Reduce Output Operator
                    key expressions: _col0 (type: string), _col1 (type: int)
                    sort order: ++
                    Map-reduce partition columns: _col0 (type: string), _col1 (type: int)
                    Statistics: Num rows: 26982 Data size: 5612279 Basic stats: COMPLETE Column stats: NONE
                    value expressions: _col2 (type: struct<count:bigint,sum:double,input:double>)
      Reduce Operator Tree:
        Group By Operator
          aggregations: avg(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 13491 Data size: 2806139 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Also, try hive> explain extended select ... to see additional details.

Output of the above query:

A  2013 32.428159357142846 A 2014 39.7366169325397
AA 2013 8.299071845238094 AA 2014 14.112061321428575
AAL 2013 17.78869600000001 AAL 2014 38.132203436507936
AAP 2013 84.73887314285709 AAP 2014 130.51677267460315
AAPL 2013 63.2123629920635 AAPL 2014 88.46767818253969
ABBV 2013 39.25952140079365 ABBV 2014 51.575054202380976
ABC 2013 54.608562777777784 ABC 2014 72.72405008333328
ABT 2013 33.30880767460318 ABT 2014 39.041839111111116
ACN 2013 70.59546130952383 ACN 2014 77.84725737698413
ADBE 2013 46.803412603174586 ADBE 2014 67.45305560317458
ADI 2013 42.75313803571431 ADI 2014 48.544903773809516
ADM 2013 32.63645731746032 ADM 2014 43.55579893650796
ADP 2013 56.84975079761904 ADP 2014 68.99841694444443
ADS 2013 191.86746031746037 ADS 2014 265.4146823611112
ADSK 2013 39.15007928571429 ADSK 2014 53.830000071428586
AEE 2013 30.59784315873017 AEE 2014 36.93181792063489
AEP 2013 40.76954225396826 AEP 2014 48.832750234127005
AES 2013 11.827752503968249 AES 2014 13.292465710317478
AET 2013 57.41888040476189 AET 2014 76.42373734126986
AFL 2013 53.827321797619014 AFL 2014 58.39401914682539
AGN 2013 122.96440459126977 AGN 2014 220.02797637698416
AIG 2013 42.92983638492063 AIG 2014 51.194616666666654
AIV 2013 26.249909853174604 AIV 2014 30.359743226190485
AIZ 2013 48.386520777777754 AIZ 2014 63.50188482936509
AJG 2013 38.94890921428571 AJG 2014 43.30382427777779
...


Benefits of CBO are realised when you run complex join involving multiple tables of different size and using columns are various degrees of cardinality.

https://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/