Hive File Format and Compression

Create hive tables on stocks data. Find a sample stocks data file here and load it in HDFS location /user/cloudera/stocks.

hive> create external table stocks(

tr_date String,

open Double,

high Double,

low Double,

close Double,

volume Double,

adjClose Double,

symbol String)

row format delimited

fields terminated by ','

stored as textfile

location '/user/cloudera/stocks'

TBLPROPERTIES("skip.header.line.count"="1");

Create a new tables stock_orc. Populate the table using data from stocks table.

hive> create table stocks_orc(

tr_date String,

open Double,

high Double,

low Double,

close Double,

volume Double,

adjClose Double,

symbol String)

STORED AS ORC

tblproperties ("orc.compress"="ZLIB");

Load data into stocks_orc from stocks table.

hive> insert into stocks_orc select * from stocks;

Compare the size of the data between the original stocks data and new orc data format. Open another linux terminal and run the following commands.

$ hadoop fs -ls -h /user/cloudera/stocks

$ hadoop fs -ls -h /user/hive/warehouse/stocks_orc

Similarly you can create a table using snappy compression as below.

hive> create table stocks_orc(

tr_date String,

open Double,

high Double,

low Double,

close Double,

volume Double,

adjClose Double,

symbol String)

STORED AS ORC tblproperties ("orc.compress"="SNAPPY");

Another way of creating table with same schema but different file is as below.

hive> set parquet.compression=GZIP;

hive> create table stocks_parquet like stocks stored as parquet;

hive> insert into stocks_parquet select * from stocks;

Check size:

hive> dfs -ls -h /user/hive/warehouse/stocks_parquet;

Other valid values for compression codec UNCOMPRESSED, SNAPPY

Similarly, create a table in avro format and load the table.

hive> SET hive.exec.compress.output=true;

hive> SET avro.output.codec=snappy;

hive> create table stocks_avro like stocks stored as avro;

hive> insert into stocks_avro select * from stocks;

You can also try bzip2 for compression code and compare the size of the data. It is expected to be much less.

View the avro schema

$ hadoop fs -get /user/hive/warehouse/stocks_avro/000000_0

$ avro-tools getschema 000000_0

If you do not have avro-tools installed, run the following commands

$ wget http://mirrors.fibergrid.in/apache/avro/avro-1.8.2/java/avro-tools-1.8.2.jar

$ java -jar avro-tools-1.8.2.jar getschema 000000_0

Inserting into HBase

Let's insert this dataset into HBase table.

Launch hbase shell and create a table 'stocks' with 'info' as column family.

$ hbase shell

$ hbase> create 'stocks', 'info'

Create a table in hive

hive> CREATE EXTERNAL TABLE `stocks_hbase`(

`key` string,

`tr_date` string,

`open` string,

`high` string,

`low` string,

`close` string,

`volume` string,

`adjClose` string,

`symbol` string)

ROW format serde 'org.apache.hadoop.hive.hbase.HBaseSerDe'

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES('hbase.columns.mapping' = ':key,info:tr_date,info:open,info:high,info:low,info:close,info:volume,info:adjClose,info:symbol')

TBLPROPERTIES ('hbase.table.name' = 'stocks');

Now insert values in the hbase table using hive. Here primary create has been derived as a combination of symbol and date.

hive> insert into stocks_hbase select concat(symbol, "-", tr_date) key, * from stocks;

Run the following query in both hive and impala. You should see results.

hive> select * from stocks_hbase;

Test the data from hbase shell

hbase(main):003:0> scan 'stocks', {LIMIT => 10}