HBase Fundamentals
What is HBase?
HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).
In the parlance of Eric Brewer’s CAP Theorem, HBase is a CP type system.
Common Use Cases of HBase
HBase is use for low latency (<10ms) for high throughput read and/or write use cases as operational data source. It supports fast CRUD operation. It is also useful for storing high velocity HA time series data (streaming data).
No SQL
Wide columnar
Schema less
Distributed
Strongly consistent
Highly scalable (~ peta byte scale)
Limitations
Does not support join, order by, group by queries
No secondary index
No support for foreign key constraint
When not to use HBase
Do not use Big Table if you need support transaction level atomicity. For example, in eCommerce, an order placement could perform a CRUD operations on orders table, order line items tables and inventory table in a single transaction. HBase supports row level atomic during. For transaction level atomicity, use RDBMS like Oracle, SQL Server (on premise) or AWS Aurora, Google Cloud SQL/Spanner (on cloud).
Do not use for data less than 1 TB. Use relational databases like Mysql, Postgres, Oracle etc.
Do not use if the primary workload is analytics oriented. Instead, use Hadoop + Hive like combination.
Do not use for documents or highly structured hierarchies. Use MongoDB, CouchDB etc.
Do not use for blob storage where typical size is > 10MB. Use distributed file systems like AWS S3, Google Cloud Storage (cloud) or MapR-FS (on premise)
Check hbase version
$ hbase version
Sanity check HBase services
Find out the hbase services
$ sudo ls -l /etc/init.d/hbase*
Check the status of the services for hbase-master and hbase-regionserver services
$ sudo service hbase-master status
$ sudo service hbase-regionserver status
If any of the services are not in running state, please restart the service, for example,
$ sudo service hbase-master restart
$ sudo service hbase-regionserver restart
hbase installation directory: /usr/lib/hbase
$ ls -l /etc/hbase/conf/
-rw-r--r-- 1 root root 1811 Mar 23 11:29 hadoop-metrics2-hbase.properties
-rw-r--r-- 1 root root 4537 Mar 23 11:29 hbase-env.cmd
-rw-r--r-- 1 root root 7468 Mar 23 11:29 hbase-env.sh
-rw-r--r-- 1 root root 2257 Mar 23 11:29 hbase-policy.xml
-rw-rw-r-- 1 root root 1648 Apr 5 16:02 hbase-site.xml
-rw-r--r-- 1 root root 4339 Mar 23 11:29 log4j.properties
-rw-r--r-- 1 root root 10 Mar 23 11:29 regionservers
hbase-env.sh: environment variable, JVM properties etc.
hbase-policy.xml: access policy for hbase services
hbase-site.xml: configuration of hbase cluster
log4j.properties: logging output controls
regionservers: list of region servers allowed to connect to HMaster
View hbase-site.xml to find
root HDFS directory
distribution mode
HBase Operations
Launch hbase shell
$ hbase shell
Command groups
hbase(main):002:0> help
HBase Shell, version 2.0.0.3.0.1.0-187, re9fcf450949102de5069b257a6dee469b8f5aab3, Wed Sep 19 10:16:35 UTC 2018
Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command.
Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group.
COMMAND GROUPS:
Group name: general
Commands: processlist, status, table_help, version, whoami
Group name: ddl
Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, list_regions, loca
te_region, show_filters
Group name: namespace
Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables
Group name: dml
Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve
Group name: tools
Commands: assign, balance_switch, balancer, balancer_enabled, catalogjanitor_enabled, catalogjanitor_run, catalogjanitor_switch, cleaner_chore_enabled, cleaner_chore_run, cleaner_chore_swi
tch, clear_block_cache, clear_compaction_queues, clear_deadservers, close_region, compact, compact_rs, compaction_state, flush, is_in_maintenance_mode, list_deadservers, major_compact, merge
_region, move, normalize, normalizer_enabled, normalizer_switch, split, splitormerge_enabled, splitormerge_switch, trace, unassign, wal_roll, zk_dump
Group name: replication
Commands: add_peer, append_peer_namespaces, append_peer_tableCFs, disable_peer, disable_table_replication, enable_peer, enable_table_replication, get_peer_config, list_peer_configs, list_p
eers, list_replicated_tables, remove_peer, remove_peer_namespaces, remove_peer_tableCFs, set_peer_bandwidth, set_peer_exclude_namespaces, set_peer_exclude_tableCFs, set_peer_namespaces, set_
peer_replicate_all, set_peer_tableCFs, show_peer_tableCFs, update_peer_config
Group name: snapshots
Commands: clone_snapshot, delete_all_snapshot, delete_snapshot, delete_table_snapshots, list_snapshots, list_table_snapshots, restore_snapshot, snapshot
Group name: configuration
Commands: update_all_config, update_config
Group name: quotas
Commands: list_quota_snapshots, list_quota_table_sizes, list_quotas, list_snapshot_sizes, set_quota
Group name: security
Commands: grant, list_security_capabilities, revoke, user_permission
Group name: procedures
Commands: abort_procedure, list_locks, list_procedures
Group name: visibility labels
Commands: add_labels, clear_auths, get_auths, list_labels, set_auths, set_visibility
Group name: rsgroup
Commands: add_rsgroup, balance_rsgroup, get_rsgroup, get_server_rsgroup, get_table_rsgroup, list_rsgroups, move_namespaces_rsgroup, move_servers_namespaces_rsgroup, move_servers_rsgroup, m
ove_servers_tables_rsgroup, move_tables_rsgroup, remove_rsgroup, remove_servers_rsgroup
Get help for a command
hbase(main):007:0> help 'list_namespace'
List all namespaces in hbase. Optional regular expression parameter could
be used to filter the output. Examples:
hbase> list_namespace
hbase> list_namespace 'abc.*'
Create a sample table
hbase> create 'sample', 'cf1', 'cf2'
Check HDFS location for the table. The value in blue is the region name (MD5 encoded value). It is likely to be different on your machine.
$ hadoop fs -ls /hbase/data/default/sample
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:01 /hbase/data/default/sample/.tabledesc
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:01 /hbase/data/default/sample/.tmp
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:01 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2
From the above directory structure, you can see that there is one region for this table. Each region is identified by 32 byte long region id.
You can also find the table in HBase master web UI http://<hbase-master>:60010
Use describe command to view the column families and other information.
hbase> describe 'sample'
Table sample is ENABLED
sample
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
{NAME => 'cf2', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
Check the folder structure inside region directory.
$ hadoop fs -ls /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2
-rw-r--r-- 1 hbase supergroup 41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits
Inside the region, for each column family, there is a separate folder.
If you look inside, one of directory dedicated to a column family, you will see there is no data inside. This is because we have not loaded any data yet.
$ hadoop fs -ls -R /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/
-rw-r--r-- 1 hbase supergroup 41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits
-rw-r--r-- 1 hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits/2.seqid
Let's put some values to the sample table.
hbase> put 'sample', 'k1', 'cf1:c1', 'v1'
hbase> put 'sample', 'k1', 'cf2:c1', 'v2'
hbase> put 'sample', 'k2', 'cf1:c2', 'v3'
Now, verify there directories again
$ hadoop fs -ls -R /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/
-rw-r--r-- 1 hbase supergroup 41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits
-rw-r--r-- 1 hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits/2.seqid
Force flush all the put to the file (generally not required under normal operational condition) . Without flush hbase will keep the data memstore. It will wait to reach a threshold (blocksize) before it flushes out to the disk.
hbase> flush 'sample'
Now, notice that 2 new files are created. These are called store files.
$ hadoop fs -ls -R /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/
-rw-r--r-- 1 hbase supergroup 41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.tmp
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1
-rw-r--r-- 1 hbase supergroup 1043 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1/587c011cf1454b5da510999e3f7d8b6a
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2
-rw-r--r-- 1 hbase supergroup 1011 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2/16716b2167be401c99a18851d367472d
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits
-rw-r--r-- 1 hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits/2.seqid
For each column family, the new data are written to file in HDFS.
Put two more values. one with new row key and one with existing row key.
hbase> put 'sample', 'k2', 'cf1:c3', 'v4'
hbase> put 'sample', 'k3', 'cf1:c1', 'v5'
hbase> flush 'sample'
Now, check the hbase directory again,
$ hadoop fs -ls -R /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/
-rw-r--r-- 1 hbase supergroup 41 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.regioninfo
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:21 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/.tmp
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:21 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1
-rw-r--r-- 1 hbase supergroup 1043 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1/587c011cf1454b5da510999e3f7d8b6a
-rw-r--r-- 1 hbase supergroup 1043 2016-08-26 12:21 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf1/9cff20f096c1487f8a10a70fe64dfce2
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2
-rw-r--r-- 1 hbase supergroup 1011 2016-08-26 12:16 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/cf2/16716b2167be401c99a18851d367472d
drwxr-xr-x - hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits
-rw-r--r-- 1 hbase supergroup 0 2016-08-26 12:04 /hbase/data/default/sample/37120ff814f8c05fe0e0015a6dadfec2/recovered.edits/2.seqid
A new store file has been created under columnfamily cf1.
Add a new column family, cf3 to the table
hbase> alter 'sample', 'cf3'
Scan the table to view the rows
hbase> scan 'sample'
Count number of rows in 'sample'
hbase> count 'sample'
A more performant version is . This count fetches 1000 rows at a time. Set CACHE lower if your rows are big. Default is to fetch one row at a time.
hbase> count 'sample', CACHE => 1000
Another way to count the number of rows in a table, especially useful for large tables.
$ hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'table name'
2020-05-11 17:31:07,922 INFO [main] mapreduce.Job: Job job_1589211235667_0003 completed successfully
2020-05-11 17:31:08,171 INFO [main] mapreduce.Job: Counters: 46
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=274556
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=215
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=178924
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=44731
Total vcore-milliseconds taken by all map tasks=44731
Total megabyte-milliseconds taken by all map tasks=45804544
Map-Reduce Framework
Map input records=2
Map output records=0
Input split bytes=215
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=393
CPU time spent (ms)=5660
Physical memory (bytes) snapshot=240848896
Virtual memory (bytes) snapshot=2881601536
Total committed heap usage (bytes)=138936320
Peak Map Physical memory (bytes)=240848896
Peak Map Virtual memory (bytes)=2881601536
HBase Counters
BYTES_IN_REMOTE_RESULTS=0
BYTES_IN_RESULTS=66
MILLIS_BETWEEN_NEXTS=4770
NOT_SERVING_REGION_EXCEPTION=0
NUM_SCANNER_RESTARTS=0
NUM_SCAN_RESULTS_STALE=0
REGIONS_SCANNED=1
REMOTE_RPC_CALLS=0
REMOTE_RPC_RETRIES=0
ROWS_FILTERED=0
ROWS_SCANNED=2
RPC_CALLS=1
RPC_RETRIES=0
org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper$Counters
ROWS=2
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
Scan return rows in ascending order of key values. [This feature is not supported in MapR-DB]
hbase> scan 'sample', {REVERSED => true}
Get row for a corresponding key 'k1'.
hbase> get 'sample', 'k1'
It will return all the column family. If you want to return only columns in cf1 column family, use the following statement.
hbase> get 'sample', 'k1', {COLUMNS => ['cf1']}
You can create a variable t for table 'sample.
hbase> t = get_table 'sample'
hbase> t.<press tab to view available functions>
hbase(main):037:0> t.Display all 239 possibilities? (y or n)
t.__id__ t.__send__ t._append_internal
t._count_internal t._createdelete_internal t._delete_internal
t._deleteall_internal t._deleterows_internal t._get_counter_internal
t._get_internal t._get_scanner t._get_splits_internal
t._hash_to_scan t._incr_internal t._put_internal
t._scan_internal t.abort_procedure t.add_labels
t.add_peer t.add_rsgroup t.alter
t.alter_async t.alter_namespace t.alter_status
t.append t.append_peer_namespaces t.append_peer_tableCFs
t.assign t.balance_rsgroup t.balance_switch
t.balancer t.balancer_enabled t.catalogjanitor_enabled
t.catalogjanitor_run t.catalogjanitor_switch t.class
t.cleaner_chore_enabled t.cleaner_chore_run t.cleaner_chore_switch
t.clear_auths t.clear_block_cache t.clear_compaction_queues
t.clear_deadservers t.clone t.clone_snapshot
t.close t.close_region t.com
t.compact t.compact_rs t.compaction_state
t.convert t.convert_bytes t.convert_bytes_with_position
t.count t.create t.create_namespace
t.debug t.debug? t.define_singleton_method
t.delete t.delete_all_snapshot t.delete_snapshot
t.delete_table_snapshots t.deleteall t.desc
t.describe t.describe_namespace t.disable
t.disable_all t.disable_peer t.disable_table_replication
t.display t.drop t.drop_all
t.drop_namespace t.dup t.enable
t.enable_all t.enable_peer t.enable_table_replication
t.enum_for t.eql? t.equal?
t.exists t.extend t.flush
t.freeze t.frozen? t.get
t.get_all_columns t.get_auths t.get_counter
t.get_peer_config t.get_rsgroup t.get_server_rsgroup
t.get_splits t.get_table t.get_table_rsgroup
t.grant t.handle_different_imports t.hash
t.help t.hlog_roll t.include_class
t.incr t.inspect t.instance_eval
t.instance_exec t.instance_of? t.instance_variable_defined?
t.instance_variable_get t.instance_variable_set t.instance_variables
t.is_a? t.is_disabled t.is_enabled
t.is_in_maintenance_mode t.is_meta_table? t.itself
t.java t.java_annotation t.java_field
t.java_implements t.java_kind_of? t.java_name
t.java_package t.java_require t.java_signature
t.javafx t.javax t.kind_of?
t.list t.list_deadservers t.list_labels
t.list_locks t.list_namespace t.list_namespace_tables
t.list_peer_configs t.list_peers t.list_procedures
t.list_quota_snapshots t.list_quota_table_sizes t.list_quotas
t.list_regions t.list_replicated_tables t.list_rsgroups
t.list_security_capabilities t.list_snapshot_sizes t.list_snapshots
t.list_table_snapshots t.locate_region t.major_compact
t.merge_region t.method t.methods
t.move t.move_namespaces_rsgroup t.move_servers_namespaces_rsgroup
t.move_servers_rsgroup t.move_servers_tables_rsgroup t.move_tables_rsgroup
t.name t.nil? t.normalize
t.normalizer_enabled t.normalizer_switch t.object_id
t.org t.parse_column_name t.private_methods
t.processlist t.protected_methods t.public_method
t.public_methods t.public_send t.put
t.remove_instance_variable t.remove_peer t.remove_peer_namespaces
t.remove_peer_tableCFs t.remove_rsgroup t.remove_servers_rsgroup
t.respond_to? t.restore_snapshot t.revoke
t.scan t.send t.set_attributes
t.set_authorizations t.set_auths t.set_cell_permissions
t.set_cell_visibility t.set_converter t.set_op_ttl
t.set_peer_bandwidth t.set_peer_exclude_namespaces t.set_peer_exclude_tableCFs
t.set_peer_namespaces t.set_peer_replicate_all t.set_peer_tableCFs
t.set_quota t.set_visibility t.show_filters
t.show_peer_tableCFs t.singleton_class t.singleton_methods
t.snapshot t.split t.splitormerge_enabled
t.splitormerge_switch t.status t.table
t.table_help t.taint t.tainted?
t.tap t.to_enum t.to_java
t.to_json t.to_s t.to_string
t.tools t.trace t.truncate
t.truncate_preserve t.trust t.unassign
t.untaint t.untrust t.untrusted?
t.update_all_config t.update_config t.update_peer_config
t.user_permission t.version t.wal_roll
t.whoami t.zk_dump
Scan table 'sample' for any value match = v3 in any column
hbase> scan 'sample', FILTER => "ValueFilter(=, 'binary:v3')"
Scan table 'sample' to find rows where cf1:c2 equals to v3
hbase> scan 'sample', COLUMNS => 'cf1:c2', FILTER => "SingleColumnValueFilter('cf1','c2',=, 'binary:v3')"
At the end disable the drop the table from HBase.
hbase> disable 'sample'
hbase> drop 'sample'
Counters
hbase(main):001:0> create 'counters', 'daily', 'weekly', 'monthly'
0 row(s) in 1.1930 seconds
hbase(main):002:0> incr 'counters', '20110101', 'daily:hits', 1
COUNTER VALUE = 1
hbase(main):003:0> incr 'counters', '20110101', 'daily:hits', 1
COUNTER VALUE = 2
hbase(main):04:0> get_counter 'counters', '20110101', 'daily:hits'
COUNTER VALUE = 2
One more example
hbase> create_namespace "inventory"
hbase> create "inventory:product", "info"
hbase> alter "inventory:product", {NAME => "info", VERSIONS => 3}
hbase> alter "inventory:product", {NAME => "reviews", VERSIONS => 3}
hbase> put "inventory:product", "r1", "info:name", "Mac pro"
hbase> put "inventory:product", "r1", "info:cpu", "12"
hbase> put "inventory:product", "r1", "info:price", "1000"
hbase> put "inventory:product", "r1", "info:price", "1100"
hbase> put "inventory:product", "r1", "info:price", "1200"
hbase> get "inventory:product", "r1", {COLUMN => "info", VERSIONS => 3}
hbase> put "inventory:product", "r2", "info:name", "Samsung S9"
hbase> put "inventory:product", "r3", "info:name", "Microsoft Surface Pro"
hbase> scan "inventory:product",{COLUMNS => ["info:name"], STARTROW => "r2", ENDROW => "r4"}
hbase> put "inventory:product", "r3", "info:price", "800"
hbase> put "inventory:product", "r3", "reviews:c2", "4.5"
Query HBase table using Hive
Create a hive table against the HBase table. Use the code snippet below and modify it based on HBase table.
hive> CREATE EXTERNAL TABLE product(
key string,
name string,
price int)
ROW format serde 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES('hbase.columns.mapping' = ':key,info:name,info:price')
TBLPROPERTIES ('hbase.table.name' = 'inventory:product');
0: jdbc:hive2://sandbox-hdp.hortonworks.com:2> select * from product;
+--------------+------------------------+----------------+
| product.key | product.name | product.price |
+--------------+------------------------+----------------+
| r1 | Mac pro | 1200 |
| r2 | Samsung S9 | NULL |
| r3 | Microsoft Surface Pro | 800 |
+--------------+------------------------+----------------+
3 rows selected (4.521 seconds)
Lets dump the HBase table into HDFS as parquet file for further analysis.
hive> create table product_parquet(
key string,
name string,
price int
) stored as parquet;
insert into product_parquet select * from product;
At the end disable the drop the table from HBase.
hbase> delete "inventory:product", "r3", "reviews:c2" # delete a cell value
hbase> deleteall "inventory:product", "r3" # delete entire row
hbase> disable "inventory:product"
hbase> drop "inventory:product"
Note: deleting rows based on row key range in not available in hbase shell. Use HBase API to delete in bulk.
https://cwiki.apache.org/confluence/display/Hive/HBaseBulkLoad'