HBase Fundamentals

What is HBase?

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).

In the parlance of Eric Brewer’s CAP Theorem, HBase is a CP type system.

Common Use Cases of HBase

HBase is use for low latency (<10ms) for high throughput read and/or write use cases as operational data source. It supports fast CRUD operation. It is also useful for storing high velocity HA time series data (streaming data).

No SQL
Wide columnar
Schema less
Distributed
Strongly consistent
Highly scalable (~ peta byte scale)

Limitations

Does not support join, order by, group by queries
No secondary index
No support for foreign key constraint

When not to use HBase

Do not use Big Table if you need support transaction level atomicity. For example, in eCommerce, an order placement could perform a CRUD operations on orders table, order line items tables and inventory table in a single transaction. HBase supports row level atomic during. For transaction level atomicity, use RDBMS like Oracle, SQL Server (on premise) or AWS Aurora, Google Cloud SQL/Spanner (on cloud).
Do not use for data less than 1 TB. Use relational databases like Mysql, Postgres, Oracle etc.
Do not use if the primary workload is analytics oriented. Instead, use Hadoop + Hive like combination.
Do not use for documents or highly structured hierarchies. Use MongoDB, CouchDB etc.
Do not use for blob storage where typical size is > 10MB. Use distributed file systems like AWS S3, Google Cloud Storage (cloud) or MapR-FS (on premise)

Check hbase version

$ hbase version

Sanity check HBase services

Find out the hbase services

$ sudo ls -l /etc/init.d/hbase*

Check the status of the services for hbase-master and hbase-regionserver services

$ sudo service hbase-master status

$ sudo service hbase-regionserver status

If any of the services are not in running state, please restart the service, for example,

$ sudo service hbase-master restart

$ sudo service hbase-regionserver restart

hbase installation directory: /usr/lib/hbase

$ ls -l /etc/hbase/conf/

-rw-r--r-- 1 root root 1811 Mar 23 11:29 hadoop-metrics2-hbase.properties

-rw-r--r-- 1 root root 4537 Mar 23 11:29 hbase-env.cmd

-rw-r--r-- 1 root root 7468 Mar 23 11:29 hbase-env.sh

-rw-r--r-- 1 root root 2257 Mar 23 11:29 hbase-policy.xml

-rw-rw-r-- 1 root root 1648 Apr 5 16:02 hbase-site.xml

-rw-r--r-- 1 root root 4339 Mar 23 11:29 log4j.properties

-rw-r--r-- 1 root root 10 Mar 23 11:29 regionservers

hbase-env.sh: environment variable, JVM properties etc.

hbase-policy.xml: access policy for hbase services

hbase-site.xml: configuration of hbase cluster

log4j.properties: logging output controls

regionservers: list of region servers allowed to connect to HMaster

View hbase-site.xml to find

root HDFS directory
distribution mode

HBase Operations

Launch hbase shell

$ hbase shell

Command groups

hbase(main):002:0> help

HBase Shell, version 2.0.0.3.0.1.0-187, re9fcf450949102de5069b257a6dee469b8f5aab3, Wed Sep 19 10:16:35 UTC 2018

Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command.

Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group.

COMMAND GROUPS:

Group name: general

Commands: processlist, status, table_help, version, whoami

Group name: ddl

Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, list_regions, loca

te_region, show_filters

Group name: namespace

Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables

Group name: dml

Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve

Group name: tools

Commands: assign, balance_switch, balancer, balancer_enabled, catalogjanitor_enabled, catalogjanitor_run, catalogjanitor_switch, cleaner_chore_enabled, cleaner_chore_run, cleaner_chore_swi

tch, clear_block_cache, clear_compaction_queues, clear_deadservers, close_region, compact, compact_rs, compaction_state, flush, is_in_maintenance_mode, list_deadservers, major_compact, merge

_region, move, normalize, normalizer_enabled, normalizer_switch, split, splitormerge_enabled, splitormerge_switch, trace, unassign, wal_roll, zk_dump

Group name: replication

Commands: add_peer, append_peer_namespaces, append_peer_tableCFs, disable_peer, disable_table_replication, enable_peer, enable_table_replication, get_peer_config, list_peer_configs, list_p

eers, list_replicated_tables, remove_peer, remove_peer_namespaces, remove_peer_tableCFs, set_peer_bandwidth, set_peer_exclude_namespaces, set_peer_exclude_tableCFs, set_peer_namespaces, set_

peer_replicate_all, set_peer_tableCFs, show_peer_tableCFs, update_peer_config

Group name: snapshots

Commands: clone_snapshot, delete_all_snapshot, delete_snapshot, delete_table_snapshots, list_snapshots, list_table_snapshots, restore_snapshot, snapshot

Group name: configuration

Commands: update_all_config, update_config

Group name: quotas

Commands: list_quota_snapshots, list_quota_table_sizes, list_quotas, list_snapshot_sizes, set_quota

Group name: security

Commands: grant, list_security_capabilities, revoke, user_permission

Group name: procedures

Commands: abort_procedure, list_locks, list_procedures

Group name: visibility labels

Commands: add_labels, clear_auths, get_auths, list_labels, set_auths, set_visibility

Group name: rsgroup

Commands: add_rsgroup, balance_rsgroup, get_rsgroup, get_server_rsgroup, get_table_rsgroup, list_rsgroups, move_namespaces_rsgroup, move_servers_namespaces_rsgroup, move_servers_rsgroup, m

ove_servers_tables_rsgroup, move_tables_rsgroup, remove_rsgroup, remove_servers_rsgroup

Get help for a command

hbase(main):007:0> help 'list_namespace'

List all namespaces in hbase. Optional regular expression parameter could

be used to filter the output. Examples:

hbase> list_namespace

hbase> list_namespace 'abc.*'

Create a sample table

hbase> create 'sample', 'cf1', 'cf2'

Check HDFS location for the table. The value in blue is the region name (MD5 encoded value). It is likely to be different on your machine.

$ hadoop fs -ls /hbase/data/default/sample

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:01 /hbase/data/default/sample/.tabledesc

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:01 /hbase/data/default/sample/.tmp

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:01 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2

From the above directory structure, you can see that there is one region for this table. Each region is identified by 32 byte long region id.

You can also find the table in HBase master web UI http://<hbase-master>:60010

Use describe command to view the column families and other information.

hbase> describe 'sample'

Table sample is ENABLED

sample

COLUMN FAMILIES DESCRIPTION

{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

{NAME => 'cf2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

Check the folder structure inside region directory.

$ hadoop fs -ls /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2

-rw-r--r-- 1 hbase supergroup 41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits

Inside the region, for each column family, there is a separate folder.

If you look inside, one of directory dedicated to a column family, you will see there is no data inside. This is because we have not loaded any data yet.

$ hadoop fs -ls -R /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/

-rw-r--r-- 1 hbase supergroup 41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits

-rw-r--r-- 1 hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits/2.seqid

Let's put some values to the sample table.

hbase> put 'sample', 'k1', 'cf1:c1', 'v1'

hbase> put 'sample', 'k1', 'cf2:c1', 'v2'

hbase> put 'sample', 'k2', 'cf1:c2', 'v3'

Now, verify there directories again

$ hadoop fs -ls -R /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/

-rw-r--r-- 1 hbase supergroup 41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits

-rw-r--r-- 1 hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits/2.seqid

Force flush all the put to the file (generally not required under normal operational condition) . Without flush hbase will keep the data memstore. It will wait to reach a threshold (blocksize) before it flushes out to the disk.

hbase> flush 'sample'

Now, notice that 2 new files are created. These are called store files.

$ hadoop fs -ls -R /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/

-rw-r--r-- 1 hbase supergroup 41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.tmp

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1

-rw-r--r-- 1 hbase supergroup 1043 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1/587c011cf1454b5da510999e3f7d8b6a

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2

-rw-r--r-- 1 hbase supergroup 1011 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2/16716b2167be401c99a18851d367472d

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits

-rw-r--r-- 1 hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits/2.seqid

For each column family, the new data are written to file in HDFS.

Put two more values. one with new row key and one with existing row key.

hbase> put 'sample', 'k2', 'cf1:c3', 'v4'

hbase> put 'sample', 'k3', 'cf1:c1', 'v5'

hbase> flush 'sample'

Now, check the hbase directory again,

$ hadoop fs -ls -R /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/

-rw-r--r-- 1 hbase supergroup 41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:21 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.tmp

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:21 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1

-rw-r--r-- 1 hbase supergroup 1043 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1/587c011cf1454b5da510999e3f7d8b6a

-rw-r--r-- 1 hbase supergroup 1043 2016-08-26 12:21 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1/9cff20f096c1487f8a10a70fe64dfce2

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2

-rw-r--r-- 1 hbase supergroup 1011 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2/16716b2167be401c99a18851d367472d

drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits

-rw-r--r-- 1 hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits/2.seqid

A new store file has been created under columnfamily cf1.

Add a new column family, cf3 to the table

hbase> alter 'sample', 'cf3'

Scan the table to view the rows

hbase> scan 'sample'

Count number of rows in 'sample'

hbase> count 'sample'

A more performant version is . This count fetches 1000 rows at a time. Set CACHE lower if your rows are big. Default is to fetch one row at a time.

hbase> count 'sample', CACHE => 1000

Another way to count the number of rows in a table, especially useful for large tables.

$ hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'table name'

2020-05-11 17:31:07,922 INFO [main] mapreduce.Job: Job job_1589211235667_0003 completed successfully

2020-05-11 17:31:08,171 INFO [main] mapreduce.Job: Counters: 46

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=274556

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=215

HDFS: Number of bytes written=0

HDFS: Number of read operations=1

HDFS: Number of large read operations=0

HDFS: Number of write operations=0

Job Counters

Launched map tasks=1

Rack-local map tasks=1

Total time spent by all maps in occupied slots (ms)=178924

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=44731

Total vcore-milliseconds taken by all map tasks=44731

Total megabyte-milliseconds taken by all map tasks=45804544

Map-Reduce Framework

Map input records=2

Map output records=0

Input split bytes=215

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=393

CPU time spent (ms)=5660

Physical memory (bytes) snapshot=240848896

Virtual memory (bytes) snapshot=2881601536

Total committed heap usage (bytes)=138936320

Peak Map Physical memory (bytes)=240848896

Peak Map Virtual memory (bytes)=2881601536

HBase Counters

BYTES_IN_REMOTE_RESULTS=0

BYTES_IN_RESULTS=66

MILLIS_BETWEEN_NEXTS=4770

NOT_SERVING_REGION_EXCEPTION=0

NUM_SCANNER_RESTARTS=0

NUM_SCAN_RESULTS_STALE=0

REGIONS_SCANNED=1

REMOTE_RPC_CALLS=0

REMOTE_RPC_RETRIES=0

ROWS_FILTERED=0

ROWS_SCANNED=2

RPC_CALLS=1

RPC_RETRIES=0

org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper$Counters

ROWS=2

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=0

Scan return rows in ascending order of key values. [This feature is not supported in MapR-DB]

hbase> scan 'sample', {REVERSED => true}

Get row for a corresponding key 'k1'.

hbase> get 'sample', 'k1'

It will return all the column family. If you want to return only columns in cf1 column family, use the following statement.

hbase> get 'sample', 'k1', {COLUMNS => ['cf1']}

You can create a variable t for table 'sample.

hbase> t = get_table 'sample'

hbase> t.<press tab to view available functions>

hbase(main):037:0> t.Display all 239 possibilities? (y or n)

t.__id__ t.__send__ t._append_internal

t._count_internal t._createdelete_internal t._delete_internal

t._deleteall_internal t._deleterows_internal t._get_counter_internal

t._get_internal t._get_scanner t._get_splits_internal

t._hash_to_scan t._incr_internal t._put_internal

t._scan_internal t.abort_procedure t.add_labels

t.add_peer t.add_rsgroup t.alter

t.alter_async t.alter_namespace t.alter_status

t.append t.append_peer_namespaces t.append_peer_tableCFs

t.assign t.balance_rsgroup t.balance_switch

t.balancer t.balancer_enabled t.catalogjanitor_enabled

t.catalogjanitor_run t.catalogjanitor_switch t.class

t.cleaner_chore_enabled t.cleaner_chore_run t.cleaner_chore_switch

t.clear_auths t.clear_block_cache t.clear_compaction_queues

t.clear_deadservers t.clone t.clone_snapshot

t.close t.close_region t.com

t.compact t.compact_rs t.compaction_state

t.convert t.convert_bytes t.convert_bytes_with_position

t.count t.create t.create_namespace

t.debug t.debug? t.define_singleton_method

t.delete t.delete_all_snapshot t.delete_snapshot

t.delete_table_snapshots t.deleteall t.desc

t.describe t.describe_namespace t.disable

t.disable_all t.disable_peer t.disable_table_replication

t.display t.drop t.drop_all

t.drop_namespace t.dup t.enable

t.enable_all t.enable_peer t.enable_table_replication

t.enum_for t.eql? t.equal?

t.exists t.extend t.flush

t.freeze t.frozen? t.get

t.get_all_columns t.get_auths t.get_counter

t.get_peer_config t.get_rsgroup t.get_server_rsgroup

t.get_splits t.get_table t.get_table_rsgroup

t.grant t.handle_different_imports t.hash

t.help t.hlog_roll t.include_class

t.incr t.inspect t.instance_eval

t.instance_exec t.instance_of? t.instance_variable_defined?

t.instance_variable_get t.instance_variable_set t.instance_variables

t.is_a? t.is_disabled t.is_enabled

t.is_in_maintenance_mode t.is_meta_table? t.itself

t.java t.java_annotation t.java_field

t.java_implements t.java_kind_of? t.java_name

t.java_package t.java_require t.java_signature

t.javafx t.javax t.kind_of?

t.list t.list_deadservers t.list_labels

t.list_locks t.list_namespace t.list_namespace_tables

t.list_peer_configs t.list_peers t.list_procedures

t.list_quota_snapshots t.list_quota_table_sizes t.list_quotas

t.list_regions t.list_replicated_tables t.list_rsgroups

t.list_security_capabilities t.list_snapshot_sizes t.list_snapshots

t.list_table_snapshots t.locate_region t.major_compact

t.merge_region t.method t.methods

t.move t.move_namespaces_rsgroup t.move_servers_namespaces_rsgroup

t.move_servers_rsgroup t.move_servers_tables_rsgroup t.move_tables_rsgroup

t.name t.nil? t.normalize

t.normalizer_enabled t.normalizer_switch t.object_id

t.org t.parse_column_name t.private_methods

t.processlist t.protected_methods t.public_method

t.public_methods t.public_send t.put

t.remove_instance_variable t.remove_peer t.remove_peer_namespaces

t.remove_peer_tableCFs t.remove_rsgroup t.remove_servers_rsgroup

t.respond_to? t.restore_snapshot t.revoke

t.scan t.send t.set_attributes

t.set_authorizations t.set_auths t.set_cell_permissions

t.set_cell_visibility t.set_converter t.set_op_ttl

t.set_peer_bandwidth t.set_peer_exclude_namespaces t.set_peer_exclude_tableCFs

t.set_peer_namespaces t.set_peer_replicate_all t.set_peer_tableCFs

t.set_quota t.set_visibility t.show_filters

t.show_peer_tableCFs t.singleton_class t.singleton_methods

t.snapshot t.split t.splitormerge_enabled

t.splitormerge_switch t.status t.table

t.table_help t.taint t.tainted?

t.tap t.to_enum t.to_java

t.to_json t.to_s t.to_string

t.tools t.trace t.truncate

t.truncate_preserve t.trust t.unassign

t.untaint t.untrust t.untrusted?

t.update_all_config t.update_config t.update_peer_config

t.user_permission t.version t.wal_roll

t.whoami t.zk_dump

Scan table 'sample' for any value match = v3 in any column

hbase> scan 'sample', FILTER => "ValueFilter(=, 'binary:v3')"

Scan table 'sample' to find rows where cf1:c2 equals to v3

hbase> scan 'sample', COLUMNS => 'cf1:c2', FILTER => "SingleColumnValueFilter('cf1','c2',=, 'binary:v3')"

At the end disable the drop the table from HBase.

hbase> disable 'sample'

hbase> drop 'sample'

Counters

hbase(main):001:0> create 'counters', 'daily', 'weekly', 'monthly'

0 row(s) in 1.1930 seconds

hbase(main):002:0> incr 'counters', '20110101', 'daily:hits', 1

COUNTER VALUE = 1

hbase(main):003:0> incr 'counters', '20110101', 'daily:hits', 1

COUNTER VALUE = 2

hbase(main):04:0> get_counter 'counters', '20110101', 'daily:hits'

COUNTER VALUE = 2

One more example

hbase> create_namespace "inventory"

hbase> create "inventory:product", "info"

hbase> alter "inventory:product", {NAME => "info", VERSIONS => 3}

hbase> alter "inventory:product", {NAME => "reviews", VERSIONS => 3}

hbase> put "inventory:product", "r1", "info:name", "Mac pro"

hbase> put "inventory:product", "r1", "info:cpu", "12"

hbase> put "inventory:product", "r1", "info:price", "1000"

hbase> put "inventory:product", "r1", "info:price", "1100"

hbase> put "inventory:product", "r1", "info:price", "1200"

hbase> get "inventory:product", "r1", {COLUMN => "info", VERSIONS => 3}

hbase> put "inventory:product", "r2", "info:name", "Samsung S9"

hbase> put "inventory:product", "r3", "info:name", "Microsoft Surface Pro"

hbase> scan "inventory:product",{COLUMNS => ["info:name"], STARTROW => "r2", ENDROW => "r4"}

hbase> put "inventory:product", "r3", "info:price", "800"

hbase> put "inventory:product", "r3", "reviews:c2", "4.5"

Query HBase table using Hive

Create a hive table against the HBase table. Use the code snippet below and modify it based on HBase table.

hive> CREATE EXTERNAL TABLE product(

key string,

name string,

price int)

ROW format serde 'org.apache.hadoop.hive.hbase.HBaseSerDe'

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES('hbase.columns.mapping' = ':key,info:name,info:price')

TBLPROPERTIES ('hbase.table.name' = 'inventory:product');

0: jdbc:hive2://sandbox-hdp.hortonworks.com:2> select * from product;

+--------------+------------------------+----------------+

| product.key | product.name | product.price |

+--------------+------------------------+----------------+

| r1 | Mac pro | 1200 |

| r2 | Samsung S9 | NULL |

| r3 | Microsoft Surface Pro | 800 |

+--------------+------------------------+----------------+

3 rows selected (4.521 seconds)

Lets dump the HBase table into HDFS as parquet file for further analysis.

hive> create table product_parquet(

key string,

name string,

price int

) stored as parquet;

insert into product_parquet select * from product;

At the end disable the drop the table from HBase.

hbase> delete "inventory:product", "r3", "reviews:c2" # delete a cell value

hbase> deleteall "inventory:product", "r3" # delete entire row

hbase> disable "inventory:product"

hbase> drop "inventory:product"

Note: deleting rows based on row key range in not available in hbase shell. Use HBase API to delete in bulk.

https://cwiki.apache.org/confluence/display/Hive/HBaseBulkLoad'

https://community.cloudera.com/t5/Community-Articles/Creating-HBase-HFiles-From-a-Hive-Table/ta-p/244627