HBase Fundamentals

What is HBase?

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).

In the parlance of Eric Brewer’s CAP Theorem, HBase is a CP type system.


Common Use Cases of HBase

HBase is use for low latency (<10ms) for high throughput read and/or write use cases as operational data source. It supports fast CRUD operation. It is also useful for storing high velocity HA time series data (streaming data).

  • No SQL
  • Wide columnar
  • Schema less
  • Distributed
  • Strongly consistent
  • Highly scalable (~ peta byte scale)


Limitations

  • Does not support join, order by, group by queries
  • No secondary index
  • No support for foreign key constraint

When not to use HBase

  • Do not use Big Table if you need support transaction level atomicity. For example, in eCommerce, an order placement could perform a CRUD operations on orders table, order line items tables and inventory table in a single transaction. HBase supports row level atomic during. For transaction level atomicity, use RDBMS like Oracle, SQL Server (on premise) or AWS Aurora, Google Cloud SQL/Spanner (on cloud).
  • Do not use for data less than 1 TB. Use relational databases like Mysql, Postgres, Oracle etc.
  • Do not use if the primary workload is analytics oriented. Instead, use Hadoop + Hive like combination.
  • Do not use for documents or highly structured hierarchies. Use MongoDB, CouchDB etc.
  • Do not use for blob storage where typical size is > 10MB. Use distributed file systems like AWS S3, Google Cloud Storage (cloud) or MapR-FS (on premise)


Check hbase version

$ hbase version


Sanity check HBase services

Find out the hbase services

$ sudo ls -l /etc/init.d/hbase*

Check the status of the services for hbase-master and hbase-regionserver services

$ sudo service hbase-master status
$ sudo service hbase-regionserver status

If any of the services are not in running state, please restart the service, for example,

$ sudo service hbase-master restart
$ sudo service hbase-regionserver restart

hbase installation directory: /usr/lib/hbase

$ ls -l /etc/hbase/conf/
-rw-r--r-- 1 root root 1811 Mar 23 11:29 hadoop-metrics2-hbase.properties
-rw-r--r-- 1 root root 4537 Mar 23 11:29 hbase-env.cmd
-rw-r--r-- 1 root root 7468 Mar 23 11:29 hbase-env.sh
-rw-r--r-- 1 root root 2257 Mar 23 11:29 hbase-policy.xml
-rw-rw-r-- 1 root root 1648 Apr  5 16:02 hbase-site.xml
-rw-r--r-- 1 root root 4339 Mar 23 11:29 log4j.properties
-rw-r--r-- 1 root root   10 Mar 23 11:29 regionservers

hbase-env.sh: environment variable, JVM properties etc.

hbase-policy.xml: access policy for hbase services

hbase-site.xml: configuration of hbase cluster

log4j.properties: logging output controls

regionservers: list of region servers allowed to connect to HMaster

View hbase-site.xml to find

  • root HDFS directory
  • distribution mode


HBase Operations

Launch hbase shell

$ hbase shell

Create a sample table

hbase> create 'sample', 'cf1', 'cf2' 

Check HDFS location for the table. The value in blue is the region name (MD5 encoded value). It is likely to be different on your machine.

$ hadoop fs -ls /hbase/data/default/sample
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:01 /hbase/data/default/sample/.tabledesc
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:01 /hbase/data/default/sample/.tmp
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:01 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2
From the above directory structure, you can see that there is one region for this table. Each region is identified by 32 byte long region id. 

You can also find the table in HBase master web UI http://<hbase-master>:60010

Use describe command to view the column families and other information.

hbase> describe 'sample'
Table sample is ENABLED
sample         
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}                                                                                                          
{NAME => 'cf2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} 

Check the folder structure inside region directory.

$ hadoop fs -ls /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2
-rw-r--r--   1 hbase supergroup         41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits

Inside the region, for each column family, there is a separate folder.

If you look inside, one of directory dedicated to a column family, you will see there is no data inside. This is because we have not loaded any data yet.

$ hadoop fs -ls -R /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/
-rw-r--r--   1 hbase supergroup         41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits
-rw-r--r--   1 hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits/2.seqid

Let's put some values to the sample table.

hbase> put 'sample', 'k1', 'cf1:c1', 'v1'
hbase> put 'sample', 'k1', 'cf2:c1', 'v2'
hbase> put 'sample', 'k2', 'cf1:c2', 'v3'

Now, verify there directories again

$ hadoop fs -ls -R /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/
-rw-r--r--   1 hbase supergroup         41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits
-rw-r--r--   1 hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits/2.seqid

Force flush all the put to the file (generally not required under normal operational condition) . Without flush hbase will keep the data memstore. It will wait to reach a threshold (blocksize) before it flushes out to the disk.

hbase> flush 'sample'

Now, notice that 2 new files are created. These are called store files.

$ hadoop fs -ls -R /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/
-rw-r--r--   1 hbase supergroup         41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.tmp
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1
-rw-r--r--   1 hbase supergroup       1043 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1/587c011cf1454b5da510999e3f7d8b6a
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2
-rw-r--r--   1 hbase supergroup       1011 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2/16716b2167be401c99a18851d367472d
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits
-rw-r--r--   1 hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits/2.seqid

For each column family, the new data are written to file in HDFS.

Put two more values. one with new row key and one with existing row key.

hbase> put 'sample', 'k2', 'cf1:c3', 'v4'
hbase> put 'sample', 'k3', 'cf1:c1', 'v5'
hbase> flush 'sample'

Now, check the hbase directory again,

$ hadoop fs -ls -R /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/
-rw-r--r--   1 hbase supergroup         41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:21 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.tmp
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:21 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1
-rw-r--r--   1 hbase supergroup       1043 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1/587c011cf1454b5da510999e3f7d8b6a
-rw-r--r--   1 hbase supergroup       1043 2016-08-26 12:21 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1/9cff20f096c1487f8a10a70fe64dfce2
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2
-rw-r--r--   1 hbase supergroup       1011 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2/16716b2167be401c99a18851d367472d
drwxr-xr-x   - hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits
-rw-r--r--   1 hbase supergroup          0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits/2.seqid

A new store file has been created under columnfamily cf1.

Add a new column family, cf3 to the table

hbase> alter 'sample', 'cf3'

Scan the table to view the rows

hbase> scan 'sample'

Count number of rows in 'sample'

hbase> count 'sample'

Another way to count the number of rows in a table

$ hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'table name'

Scan return rows in ascending order of key values. [This feature is not supported in MapR-DB]

hbase> scan 'sample', {REVERSED => true}

Get row for a corresponding key 'k1'.

hbase> get 'sample', 'k1'

It will return all the column family. If you want to return only columns in cf1 column family, use the following statement.

hbase> get 'sample', 'k1', {COLUMNS => ['cf1']}

You can create a variable t for table 'sample.

hbase> t = get_table 'sample'

hbase> t.<press tab to view available functions>

Scan table 'sample' for any value match = v3 in any column

hbase> scan 'sample', FILTER => "ValueFilter(=, 'binary:v3')"

Scan table 'sample' to find rows where cf1:c2 equals to v3

hbase> scan 'sample', COLUMNS => 'cf1:c2', FILTER => "SingleColumnValueFilter('cf1','c2',=, 'binary:v3')"


Query HBase table using Hive

Create a hive table against the HBase table. Use the code snippet below and modify it based on HBase table.

hive> CREATE EXTERNAL TABLE product(
    key string,
    name string,
    price int)
ROW format serde 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES('hbase.columns.mapping' = ':key,info:name,info:price')
TBLPROPERTIES ('hbase.table.name' = 'product');

Lets dump the HBase table into HDFS as parquet file for further analysis.

hive> create table product_parquet(
    key string,
    name string,
    price int
) stored as parquet;
insert into product_parquet select * from product;



At the end disable the drop the table from HBase.

hbase> disable 'sample'
hbase> drop 'sample'