Hive File Format and Compression

Create hive tables on stocks data. Find a sample stocks data file here and load it in HDFS location /user/cloudera/stocks.

hive> create external table stocks(
tr_date String,
open Double,
high Double,
low Double,
close Double,
volume Double,
adjClose Double,
symbol String)
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/stocks'
TBLPROPERTIES("skip.header.line.count"="1");


Create a new tables stock_orc. Populate the table using data from stocks table.

hive> create table stocks_orc(
  tr_date String,
 open Double,
 high Double,
 low Double,
 close Double,
 volume Double,
 adjClose Double,
 symbol String)
STORED AS ORC
tblproperties ("orc.compress"="ZLIB");




Load data into stocks_orc from stocks table.

hive> insert into stocks_orc select * from stocks;

Compare the size of the data between the original stocks data and new orc data format. Open another linux terminal and run the following commands.

$ hadoop fs -ls -h /user/cloudera/stocks
$ hadoop fs -ls -h /user/hive/warehouse/stocks_orc

Similarly you can create a table using snappy compression as below.

hive> create table stocks_orc(
  tr_date String,
 open Double,
 high Double,
 low Double,
 close Double,
 volume Double,
 adjClose Double,
 symbol String)
STORED AS ORC tblproperties ("orc.compress"="SNAPPY");

Another way of creating table with same schema but different file is as below.

hive> set parquet.compression=GZIP;
hive> create table stocks_parquet like stocks stored as parquet;
hive> insert into stocks_parquet select * from stocks;

Check size:

hive> dfs -ls -h /user/hive/warehouse/stocks_parquet;

Other valid values for compression codec UNCOMPRESSED, SNAPPY

Similarly, create a table in avro format and load the table.

hive> SET hive.exec.compress.output=true;
hive> SET avro.output.codec=snappy; 
hive> create table stocks_avro like stocks stored as avro;
hive> insert into stocks_avro select * from stocks;

You can also try bzip2 for compression code and compare the size of the data. It is expected to be much less.

View the avro schema

$ hadoop fs -get /user/hive/warehouse/stocks_avro/000000_0
$ avro-tools getschema 000000_0


If you do not have avro-tools installed, run the following commands

$ wget http://mirrors.fibergrid.in/apache/avro/avro-1.8.2/java/avro-tools-1.8.2.jar 
$ java -jar avro-tools-1.8.2.jar getschema 000000_0 

Inserting into HBase

Let's insert this dataset into HBase table.

Launch hbase shell and create a table 'stocks' with 'info' as column family.

$ hbase shell
$ hbase> create 'stocks', 'info'

Create a table in hive

hive> CREATE EXTERNAL TABLE `stocks_hbase`(
`key` string,
`tr_date` string,
`open` string,
`high` string,
`low` string,
`close` string,
`volume` string,
`adjClose` string,
`symbol` string)
ROW format serde 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES('hbase.columns.mapping' = ':key,info:tr_date,info:open,info:high,info:low,info:close,info:volume,info:adjClose,info:symbol')
TBLPROPERTIES ('hbase.table.name' = 'stocks');

Now insert values in the hbase table using hive. Here primary create has been derived as a combination of symbol and date.

hive> insert into stocks_hbase select concat(symbol, "-", tr_date) key, * from stocks;

Run the following query in both hive and impala. You should see results.

hive> select * from stocks_hbase; 

Test the data from hbase shell

hbase(main):003:0> scan 'stocks', {LIMIT => 10}