Hive File Format and Compression
Create hive tables on stocks data. Find a sample stocks data file here and load it in HDFS location /user/cloudera/stocks.
hive> create external table stocks(
tr_date String,
open Double,
high Double,
low Double,
close Double,
volume Double,
adjClose Double,
symbol String)
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/stocks'
TBLPROPERTIES("skip.header.line.count"="1");
Create a new tables stock_orc. Populate the table using data from stocks table.
hive> create table stocks_orc(
tr_date String,
open Double,
high Double,
low Double,
close Double,
volume Double,
adjClose Double,
symbol String)
STORED AS ORC
tblproperties ("orc.compress"="ZLIB");
Load data into stocks_orc from stocks table.
hive> insert into stocks_orc select * from stocks;
Compare the size of the data between the original stocks data and new orc data format. Open another linux terminal and run the following commands.
$ hadoop fs -ls -h /user/cloudera/stocks
$ hadoop fs -ls -h /user/hive/warehouse/stocks_orc
Similarly you can create a table using snappy compression as below.
hive> create table stocks_orc(
tr_date String,
open Double,
high Double,
low Double,
close Double,
volume Double,
adjClose Double,
symbol String)
STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
Another way of creating table with same schema but different file is as below.
hive> set parquet.compression=GZIP;
hive> create table stocks_parquet like stocks stored as parquet;
hive> insert into stocks_parquet select * from stocks;
Check size:
hive> dfs -ls -h /user/hive/warehouse/stocks_parquet;
Other valid values for compression codec UNCOMPRESSED, SNAPPY
Similarly, create a table in avro format and load the table.
hive> SET hive.exec.compress.output=true;
hive> SET avro.output.codec=snappy;
hive> create table stocks_avro like stocks stored as avro;
hive> insert into stocks_avro select * from stocks;
You can also try bzip2 for compression code and compare the size of the data. It is expected to be much less.
View the avro schema
$ hadoop fs -get /user/hive/warehouse/stocks_avro/000000_0
$ avro-tools getschema 000000_0
If you do not have avro-tools installed, run the following commands
$ wget http://mirrors.fibergrid.in/apache/avro/avro-1.8.2/java/avro-tools-1.8.2.jar
$ java -jar avro-tools-1.8.2.jar getschema 000000_0
Inserting into HBase
Let's insert this dataset into HBase table.
Launch hbase shell and create a table 'stocks' with 'info' as column family.
$ hbase shell
$ hbase> create 'stocks', 'info'
Create a table in hive
hive> CREATE EXTERNAL TABLE `stocks_hbase`(
`key` string,
`tr_date` string,
`open` string,
`high` string,
`low` string,
`close` string,
`volume` string,
`adjClose` string,
`symbol` string)
ROW format serde 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES('hbase.columns.mapping' = ':key,info:tr_date,info:open,info:high,info:low,info:close,info:volume,info:adjClose,info:symbol')
TBLPROPERTIES ('hbase.table.name' = 'stocks');
Now insert values in the hbase table using hive. Here primary create has been derived as a combination of symbol and date.
hive> insert into stocks_hbase select concat(symbol, "-", tr_date) key, * from stocks;
Run the following query in both hive and impala. You should see results.
hive> select * from stocks_hbase;
Test the data from hbase shell
hbase(main):003:0> scan 'stocks', {LIMIT => 10}