Hive Table - Indexing

Goal:

  • Use cases where indexing can help
  • Create indexing and measure performance

Step 1: First find some stat about the data. Here we will use access log information. Follow the blogs below to create the tables first.

Step 2:

SELECT histogram_numeric(CAST(size AS INT), 10) FROM accesslogs_partitioned;

[{"x":170096.430837976,"y":4122942.0},{"x":3668141.89473684,"y":38.0},{"x":1.3545640088829076E7,"y":4458.0},{"x":2.00982279322034E7,"y":708.0},{"x":2.7648847062949654E7,"y":556.0},{"x":4.013085605172412E7,"y":116.0},{"x":4.7630542858585835E7,"y":99.0},{"x":6.177313177941177E7,"y":68.0},{"x":7.09350905E7,"y":2.0},{"x":8.0215074E7,"y":1.0}]

Let's simplify the above data

Step 3:

CREATE TEMPORARY TABLE accesslogs_hist AS SELECT inline(histogram_numeric(CAST(size AS INT), 10)) FROM accesslogs_partitioned;

SELECT CAST(x AS BIGINT), CAST(y AS BIGINT) FROM accesslogs_hist;

+-----------+----------+--+

| x | y |

+-----------+----------+--+

| 170096 | 4122942 |

| 3668141 | 38 |

| 13545640 | 4458 |

| 20098227 | 708 |

| 27648847 | 556 |

| 40130856 | 116 |

| 47630542 | 99 |

| 61773131 | 68 |

| 70935090 | 2 |

| 80215074 | 1 |

+-----------+----------+--+