HAdmin - HBase UI. Articles about HBase


Introduction to HBase

Igor Skokov
February 9, 2019


This article will start a blog dedicated to the overall capabilities and use cases of the HBase.
This article will explain about how HBase works, how it store your data and when you can use it. This post is not deep dive into HBase internals, and mostly aimed for “beginners” in HBase world and tries to help to understand is HBase is suitable for his/her task.

Let’s see what you can learn from this blog post:

Each sections contains links to various HBase related resources which describes one or another feature in more details.

Part 1

This chapter dedicated to common description of HBase, available client features and API examples.

What is HBase

Let’s start with a little history. Back in 2006 Google announce paper which describes it internal distributed storage called BigTable. It was designed to store petabytes of data on thousands of servers. A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
HBase is NoSQL database based on Google BigTable architecture. HBase is really more a “Data Store” than “Database” because it lacks many of the features you find in an RDBMS, such as typed columns(all data is uninterpreted raw bytes for HBase), secondary indexes, triggers, and advanced query languages, etc(but most of this features covered by Apache Phoenix which works on top of HBase).
However, HBase has advantage over classic RDBMS: linear scaling. An RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best performance requires specialized hardware and storage devices. HBase clusters expanded by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 RegionServers, for example, it increases both in terms of storage and processing capacity.
Most notable HBase features are:

Data model

High level representation of data stored inside HBase is similar to relational databases. It has notion of tables, namespaces(e.g. databases), columns.
Data stored in tables which consist of rows with columns which store actual values. Tables can be logically grouped into namespaces which can be thought as databases. Namespaces primarily used to set resource quotas or common security settings to group of tables.
Although, all said earlier is very similar to relational DBMS(RDBMS), but it has little to do with RDBMS. Difference begins from how HBase stores and consequently accesses data. Unlike RDBMS, tables in HBase has no schema and contain rows which identified by unique key(like a primary key in RDBMS). Each row can contain arbitrary count of columns with some binary data in it(HBase doesn’t interpret data inside column and treat it as raw byte array). Each column has unique identifier represented by column family and column qualifier.

Column family(CF) represents a group of columns which usually accessed together(in same request) or/and have common storage properties such as compression, data encoding, caching, etc. Physically CF data stored in files which contain only columns belongs to this CF, that is why accessing columns in same CF is very efficient(HBase scan only subset of all table files to find required columns). Table can contain multiple column families, but in practice 2-3 CF is reasonable value(more CF can affect performance, see docs).

HBase can store multiple versions of column(optionally with TTL). Each version identified by timestamp. Timestamp can be set in write request to HBase. By default HBase sets timestamp at server to the current value of epoch time in milliseconds.
Versions count stored by HBase can be set per column family.
Concept of versions looks very simple at first, but can be not so obvious in some corner cases(more in following section).

Combination of row key, column family, column qualifier and version is called a cell.
Cell contains actual data stored as raw binary(byte array). Cell content is not interpreted by HBase, except for one special case of atomic counters(see docs for more info).

Let’s see example of how HBase row look like.
Suppose that table contains statistics for visit tracking of the web pages. Table consist of “info” and “stats” column families with row key which contains ID of user which visit web page.

row key(user ID) CF:‘info’, qualifier: ‘user_nickname’ CF: ‘stats’, qualifier: ‘https://web1.com CF: ‘stats’, qualifier: ‘https://web2.com Timestamp
651 lee 205 600 1543139553000
651 lee 409 1100 1543139953000
442122 dug_1995 10 152 1543139597000
442122 dug_1995 170 302 1543149997000

More info about HBase data model can be found here.

Manipulating table data

We know that data inside HBase can organized into tables. And now we can discuss how to read, write and remove data from tables. First set of operations used to insert, update and remove rows and columns.

Operations related to data access:

Non-CAS mutation operation can be batched to update multiple rows in one request. Batch can mix read and write operations and result of each operation can be accessed separately. Batch operation doesn’t guarantee atomicity(see ACID section), e.g. some operations can end up with success but other fail.

Let’s see how data operation works by example. Suppose we have schema same as in Data Model section example. In this section we will use official Java HBase API, but this example will not be exhaustive enough to learn Java API. For more information see javadocs, HBase client docs.

Let’s create table “page_visits”. We have a few options to do this:

create ‘page_visits’, {NAME => ‘info’}, {NAME => ‘stats’}


And now we ready to write some code which will show how to manipulate data in HBase. Next example have a comments which should clarify each step.

// create Table instance and put data to table, suppose we already open HBase connection
var pageVisitTable = hbaseConnection.getTable(TableName.valueOf("page_visits"));
// add 2 new users to table without any page visit statistics.
// all data in HBase stored as raw byte array. We use use standard HBase client class Bytes to convert row key, CF, qualifier and value into byte arrays. 
byte[] infoCf = Bytes.toBytes("info");
byte[] nicknameColQuaflifier = Bytes.toBytes("user_nickname");
byte[] user1RowKey = Bytes.toBytes("user_id-1");
byte[] user2RowKey = Bytes.toBytes("user_id-2");
Put user1 = new Put(user1RowKey).addColumn(infoCf, nicknameColQuaflifier, Bytes.toBytes("john_1985"));
Put user2 = new Put(user2RowKey).addColumn(infoCf, nicknameColQuaflifier, Bytes.toBytes("lee"));
pageVisitTable.put(List.of(user1, user2));

// and now we ready to add page visit statistics to users
// for each user we add 1 page visit
byte[] statsCf = Bytes.toBytes("stats");
var statInc1 = new Increment(user1RowKey)
                    .addColumn(statsCf, Bytes.toBytes("http://hbase.apache.org/apidocs"), 1L);
var statInc2 = new Increment(user2RowKey)
                    .addColumn(statsCf, Bytes.toBytes("http://hbase.apache.org/book.html"), 1L);
// we can put increment one by one 
// or using batch
var batchResult = new Object[2];
pageVisitTable.batch(List.of(statInc1, statInc2), batchResult);

// and now we ready to read current state of user statistics
var statsScan = new Scan().withStartRow(user1RowKey)
                          .addFamily(statsCf); // we need only page visit stats, not a user info  
ResultScanner scanner = table.getScanner(statsScan);
for (Result res : scanner) { // request next portion of data
  for (Cell statCell : res.listCells()) {
    // each returned cell contains web page as column qualifier and page visit count as value
    var webPage = Bytes.toString(CellUtil.cloneQualifier(statCell));
    var counter = Bytes.toString(CellUtil.cloneValue(statCell));
// close scanner object to cleat state on server

// user1 no more required, remove it with statistics
table.delete(new Delete(user1RowKey));
// try to find removed user
Result getResult = table.get(new Get(user1RowKey));
if (getResult.isEmpty()) {
  System.out.println("User not found");

// if user2 visits page 'http://hadoop.apache.org' more than 1000 times, remove counter
var pageCol = Bytes.toBytes("http://hadoop.apache.org");
boolean removed = table.checkAndMutate(user2RowKey, statsCf)
                       .ifMatches(CompareOperator.GREATER_OR_EQUAL, Bytes.toBytes(1000L))
                       .thenDelete(new Delete(user2RowKey).addColumn(statsCf, pageCol));

Data versioning

As noted above, each column in HBase can have multiple version, identified by timestamp. Particular version of column can be referenced by using of cell which is a tuple [row key,column family, column qualifier, timestamp].

Recent versions of HBase store 1 version of data by default. You can set max versions count per column family by using:

alter ‘your_table_name′, NAME => ‘multi_version_cf′, VERSIONS => 3
var admin = hbaseConnection.getAdmin();
var cf = ColumnFamilyDescriptorBuilder.newBuilder("multi_version_cf")
var tableDesc = TableDescriptorBuilder.newBuilder(TableName.valueOf("test_table"))


Timestamp can be set in write request to HBase, by default HBase set timestamp at server to current value of epoch time in milliseconds. Timestamp can be any positive 64-bit value, not strictly increasing in time.
Get API returns only latest cell version(with greatest timestamp), not the last written cell. Get request allows to configure timestamp range per column family. Such get request returns cells which fall into this range.

The most interesting part of versioning is deletion. HBase API allow to remove cells:

More info about versions in HBase can be found in docs.


HBase can handle stale data by removing it after some time. This feature is called Time-To-Live(TTL).
TTL can be configured on column family or on particular cell. There are notable differences in handling TTL for cell/CF:

You can set TTL value from:

create ‘test_table’, {NAME => ‘cf1’, VERSIONS => 1, TTL => 2592000}


More info about TTL see in docs.


In this section we discuss ACID semantics in common and how HBase conforms to it. Who is familiar with ACID terminology can freely skip next paragraph and go directly to HBase ACID part.
ACID is a set of guarantees provided by database during transaction processing. This is acronym for “Atomicity, Consistency, Isolation and Durability”. Let’s discuss each component of ACID:

HBase is not an ACID compliant database, however, it does guarantee some specific properties. Next section contains only brief review of HBase ACID properties with few examples to ease understanding. Full version can be found in documentation).



// suppose we have 5 clients with ID 1, 2,...5. Each client put columns "c1" and "c2" with numeric value equals to it ID. All 5 client works at the same time without any synchronization.  
var row = Bytes.toBytes("r1");
var cf = Bytes.toBytes("col_family_name");
var c1 = Bytes.toBytes("c1");
var c2 = Bytes.toBytes("c2");
Put putR1Row = new Put(row).addColumn(cf, c1, Bytes.toBytes(CLIENT_ID))
                           .addColumn(cf, c2, Bytes.toBytes(CLIENT_ID));


// at the same time somewhere on some other server read-only client try to get 
// values of "c1", "c2" columns in "r1" row.
var row = Bytes.toBytes("r1");
var cf = Bytes.toBytes("col_family_name");
var c1 = Bytes.toBytes("c1");
var c2 = Bytes.toBytes("c2");
Get getR1Row = new Get(row).addColumn(cf, c1)
                           .addColumn(cf, c2);
Result getResult = table.get(getR1Row);
// this never print results such as:
// C1=1
// C2=2 
// or 
// C1=5
// C2=3
// possible results can be: 
// C1=1 
// C2=1 
// or 
// C1=2
// C2=2 
// or 
// ......
// C1=5
// C2=5  
System.out.println("C1=" + getResult.getValue(cf, c1));
System.out.println("C2=" + getResult.getValue(cf, c2));



Part 2

Next chapter will be dedicated to HBase administration topics, e.g. HBase cluster architecture, replication, data storage format, etc. It will be helpful for system administrators as well as developers which want to know how HBase works inside.

HBase architecture

We start from components which HBase cluster have under hood and how it interacts with each other.
HBase runs on top of Apache Hadoop and Apache Zookeeper. But honestly, it mostly requires HDFS where it stores the data.
Apache Zookeeper cluster is used for failure detection of HBase nodes and stores distributed configuration of HBase cluster(more info in following sections).
HBase cluster consists of one active Master plus several backup Masters and many RegionServers.


Master is responsible for following tasks:

HBase cluster typically consists of multiple Masters, one of which is active and other are backup. When all master instances run, each start leader election(by using Zookeeper) to become an active master. Then some instance wins election, others switch to “observer” state and wait until active master will fail(and start new round of election).


Other component of HBase cluster is RegionServer. You can think about it as a “worker” node which is responsible for serving client requests and managing data regions.
Let’s talk about regions. As we already know, tables in HBase consist of rows which are identified by key. Rows are sorted according it’s key in data structures inside of HBase. Region is a group of continuous rows defined by start key and end key of rows which belong to it.
RegionServer hosts multiple regions of different tables. It’s important to note that regions of the same table may be hosted on different servers, e.g. table data is distributed across cluster. But each region is managed by only one RegionServer at a time(this guarantees that row mutation is atomic, see ACID section).
When RegionServer fails, Master reassign all regions to another RegionServers. Because all region’s data stored on HDFS, Master can safely assign region to any live server.
Typically, RegionServer and HDFS DataNode are collocated on theFollowing simplified diagram shows typical HBase cluster and how it components interact same host. When RegionServer save data on HDFS it will write first replica on the same node as RegionServer, and other replicas on remote DataNodes. During the time, RegionServer will perform compaction of files and compacted files will be written on local DataNode. From this point, data will be local for RegionServer which will improve performance. More information about data locality see in docs.

Other notable RegionServer responsibilities are:

Following simplified diagram shows typical HBase cluster and how it components interacts with each other:


Client-cluster interaction

In this section we get high level overview of how clients connect to HBase cluster and interact with it. Following simplified diagram shows how client interacts with HBase cluster:

Client requires Zookeeper quorum connection string(which contains all Zookeeper quorum servers, e.g. “server1:port, …, serverN:port”) and base znode which is used by HBase cluster(see zookeeper.znode.parent server property). It will be used to connect to Zookeeper quorum and read location of hbase:meta system table(which RegionServer manage it now). Then it connects to this RegionServer and read content of hbase:meta to cache region locations. hbase:meta table contains metadata of all regions of all tables managed by cluster. Using cached region metadata, client can find RegionServer which can handle request for particular row.
But this cache can decay. If so, then Master reassing regions between RegionServers. In this case, client will request RegionServer which already relinquish region serving and it responds with “NotServingRegion” error. On receiving “NotServingRegion” error, client will invalidate hbase:meta cache and repeat request to new RegionServer.
As you can see, for typical data manipulation requests, client doesn’t interact with Master and doesn’t depend on it’s availability. But administration API(table/CF/namespace creation, start load balancing,region compaction, etc) requests requires connection with active HBase Master.

On-disk data representation

In this section we discuss how HBase stores data on disk.
As we already know, RegionServer responsible for data managing in HBase cluster. For each column family in each region, RegionServer create so-called Store. Store consists of MemStore and a collection of on-disk StoreFiles(HFiles).
MemStore is in-memory data structure implemented by skip list. It contains cells or key-values which are represent last data changes. All Put and Delete requests served by RegionServer applied to MemStore and also written into WAL for durability. MemStore have configurable max size which by default is 128MB. When MemStore will reach this limit, it will be flushed to disk and RegionServer create new empty MemStore.
HFile is file format based on SSTables described in BigTable paper. HFile consist of sequence of blocks of different types with size of 64KB(configurable value). Blocks can have different types:

Data blocks

Each data block consists of KeyValue data structures. KeyValues inside of HFile are sorted according following rule: first by row, then by ColumnFamily, followed by column qualifier, and finally timestamp (sorted in reverse, so newest records are returned first).
KeyValue never cross block boundaries, e.g. if it have size greater than block size, it will be written into one block.


As you already know, HBase row consist of many cells which are presented as KeyValues on disk. In common case, cells of the same row have many fields which contain same data(most frequent is a row key and column family). To reduce disk usage, HBase have a option to enable data encoding/compression. More information about which compression/encoding algorithm to choose, read the Compression and Data Block Encoding In HBase section in official docs.

Index blocks

Index blocks inside of HFile contains index structure. It provide quick binary search by key to find blocks which contains particular row.

Bloom filter blocks

Bloom filter blocks contain chucks of Bloom Filter. Bloom Filter is a data structure which is designed to predict whether a given element is a member of a set of data.
When HBase tries to execute Get request for row, it uses Bloom Filters to detect whether row present in this HFile. If not, then HBase skips entire HFile and keeps scanning other files. But it is important to note, that Bloom Filters is a probabilistic structure which can get “false positives”, e.g. it can say that row contained in HFile, but actually it doesn’t. In that case HBase must perform additional reads of HFile to ensure that row present in file.

Description of HFile format can be found in docs.

Region replication

HBase provides strongly consistent reads and writes. Strong consistency achieved by fact that each region managed by one RegionServer. But in case of RegionServer failure, all it regions will be inaccessible during some time. This time defined by Zookeeper session timeout, by default, 90 sec(see docs). That timeout value can be decreased to reduce time to recovery(TTR). But this can lead to spurious failures caused by temporary network issues, Java GC, etc which can lead to excessive regions reassignment and consequently to improper balance of regions across RegionServers.
Feature known as region replication designed to partially overcome this limitation. By default, each region have only 1 replica. When replication factor increased to 2 or more, region will be assigned to several RegionServers. One of this replicas is primary, which accepts writes and reads to this region. Other replicas is a secondary, it can handle only read requests.
HBase replication is asynchronous process and can take some time to propagate new writes to secondary replicas. Because visibility of a changes can be delayed, client have two options:

Timeline read example

Suppose that we have 3 RegionServer(RS1,RS2,RS3), 1 write-only client(CW1) and 2 read-only(CR1 and CR2). RS1 hosts primary replica, RS2 and RS3 hosts secondary(RS2, RS3 is a replication sinks). CW1 client execute 3 write operations W1, W2, W3 one by one with some time delays between. CR1 client read only from RS1 and CR2 use timeline reads and can read from any server.
As you can see on picture, CR1 always read last value W3 written by W1. CR2 execute 3 consequent reads with timeline consistency:

  1. First read get response from primary replica and see last written value W3.
  2. Second read goes to RS2 server which see W2 as last value.
  3. And third read operation get response from RS3 which see W1 as last written value.

As you can see, 3 consequent read operation with timeline consistency can return different result during the time. Application should be ready for this behavior to handle stale reads.


Replication caveats

Replication have few caveats which you should prepare to handle.

For more information about replication, see docs.

Part 3

This is last chapter in this post. At this point, we know how HBase works and ready to describe some possible use cases of HBase. Examples have domain-specific description and detailed explanation how we store data inside of HBase.

Realtime data aggregation

Suppose, we have advertising platform which show ads on web sites and/or mobile apps. Advertisers start it campaigns and such platforms collect information about ad events, such as when ad was showed(impression event) or when user click on ad’s banner.
Suppose, impression and click events written to Apache Kafka by service which catch it. And now, we as a team which prepare reports, need to collect all event into some aggregated form. This aggregations can be used as building blocks for time-based reports, where user want to get report with count of impressions and clicks some campaign in selected time range. Resulting data in report should be hour granularity, e.g. user want to see counter change each hour:

Time Impressions Clicks
“2018-12-15 14:00” 12 5
“2018-12-15 15:00” 17 12
“2018-12-16 12:00” 100 67

This task can be solved with help of HBase.

In fact, our report consists of two sub-reports: impressions count and click count. Let’s mark each subtype according to it event type: EventType.IMPRESSION and EventType.CLICK.
And now we have stream of events of different types which we poll from Kafka. Because we have large number of events, we decide to run few instances of our reporting service. Each service receive of batch of events(Kafka return bunch of records on each poll operation) and group it by [campaign ID, event type, timestamp] and compute count for each group. Timestamp in each group is a truncation of original event timestamp to beginning of hour, for instance, if ts=“2018-12-15 14:23:45” then truncated value will be “2018-12-15 14:00:00”. This truncation make possible to group same event which belong to same hour.
Now we need to save new value of counter. We will use following schema to store data in HBase:

Because we have few instances which can change counter value concurrently, we can’t simply put new counter value. We will use HBase Increment operation to atomically increment current counter value.
One interesting nuance on how we aggregate and write data to HBase. As you can see, we poll data from Kafka by batches and perform pre-aggregation by grouping events by [campaign ID, event type, timestamp].
Reader can propose other more simple solution which doesn’t aggregate data on service side, but send bunch of increments by one for each [campaign ID, event type, timestamp]. This solution will work but have performance impact because increment operation use some sort of CAS on server side. When all service instances will send batch of increment operations, HBase will try to apply increments from few client requests to same cell. This will create contention on server side and affect request execution time. That’s why we use pre-aggregation on service side and reduce count of increments executed on HBase side.

File system image storage

Suppose, we try to implement high level document/file store for non-advanced users which will interact with it though web interface. First idea, that you can imagine, is to store file content inside HBase, because it can store raw binary data in cells. But, in reality, HBase not designed to storage big BLOBs.
In documentation we can find section with name of “Storing Medium-sized Objects(MOB)” which starts with following: “Data comes in many sizes, and saving all of your data in HBase, including binary data such as images and documents, is ideal”. Great, this can help us. But if we read whole section, we realize that this feature focusing on BLOBs of size between 100KB and 10MB. To simplify things and suppose we can’t have files greater that 10MB in our system.
Typical file system contains file system image(FS hierarchy, tree of directories and files) and actual content of files(binary data). Using MOB features of HBase, we can easily store file content inside cells. Now, we need to define how we will store FS image.
FS image, as well as binary data of files, must be durably stored to prevent data loss. Also, we need fast access to list directory content. And again, HBase meet this requirements, because it provide durable storage and fast key-value access.
FS image can be represent as tree with nodes of different types. For simplicity, let’s suppose that we should support only files and directories. As any typical file system, our FS also define root node, which we mark as “/”. This is special type of node and any other FS tree node is successor of it.


Now, we start to design how we will store FS image in HBase.
As we defined earlier, we have 2 type of nodes, directory and files and 1 special root node. Each node will be represented as row in HBase table. Row key will contains unique node ID(for instance, GUID), generated when node was created. We cannot use node name(which is file or directory name) because it’s not unique.
Each row in HBase will have 2 column families: one for metadata and one for file’s content(MOB enabled column family). Metadata will contain:

We should define operations which will be supported by our file system:

  1. Create file or directory
  2. Read file content
  3. Write file content
  4. List content of directory

From here, we start some coding and demonstrate how each operation can be implemented using HBase client API. This examples is not written to be optimal at performance point of view, as well as doesn’t contain exhaustive checks to prevent all type of errors.

Create file with content

// define ID of root dir as contant ROOT_ID=0
// create file under root directory
byte[] fileNodeId = generateNewId();
byte[] metadataCf = Bytes.toBytes("metadata");
byte[] fileDataCf = Bytes.toBytes("file_data");
byte[] parentCol = Bytes.toBytes("parent");
byte[] nodeTypeCol = Bytes.toBytes("type");
byte[] fileContentCol = Bytes.toBytes("content");
// prepare file metadata
Put newFileNodePut = new Put(fileNodeId)
                           .addColumn(metadataCf, parentCol, Bytes.toBytes(ROOT_ID)) // parent node is root node
                           .addColumn(metadataCf, nodeTypeCol, Bytes.toBytes("FILE")) // we create file, set type appropriately
                           .addColumn(/** other file metadata **/); // set name, create ts, owner, etc

// load file content
byte[] rawFileContent = getFileContent();
Put fileContentPut = new Put(fileNodeId)
                          .addColumn(fileDataCf, fileContentCol, rawFileContent);
RowMutations fileCreateMutations = new RowMutations(fileNodeId)
// use CAS to prevent overriding of existing file with same ID
boolean isFileCreated = table.checkAndMutate(fileNodeId, metadataCf)
if (!isFileCreated) {
  throw new FileAlreadyExistsException();

List content of directory

// define ID of root dir as contant ROOT_ID=0

// define interface which represent FS node and 2 implementations
interface Node {
  byte[] id(); //node ID
  String name(); // node name, e.g. file or directory name
  Node parent(); // parent directory node link
  List<Node> children(); // directory children list

class FileNode implements Node {....}
class DirectoryNode implements Node {....}

// define method in some class which will load node by ID from HBase table
Node loadNode(byte[] nodeId) throws NodeNotExistsException {
  byte[] metadataCf = Bytes.toBytes("metadata");
  byte[] fileDataCf = Bytes.toBytes("file_data");
  byte[] parentCol = Bytes.toBytes("parent");
  byte[] nodeTypeCol = Bytes.toBytes("type");
  byte[] nodeNameCol = Bytes.toBytes("name");
  byte[] nodeChildrenCol = Bytes.toBytes("children");
  byte[] fileContentCol = Bytes.toBytes("content");

  //get all node metadata
  Get nodeGet = new Get(nodeId)
  Result nodeMetaResult = table.get(nodeGet);
  if (nodeMetaResult.isEmpty()) {
    return new NodeNotExistsException();
  byte[] parentNodeId = nodeMetaResult.getValue(parentCol);
  byte[] nodeType = nodeMetaResult.getValue(nodeTypeCol);
  String nodeName = Bytes.toString(nodeMetaResult.getValue(nodeNameCol));
  // load other metadata....

  // load parent node metadata
  Node parent = nodeId == ROOT_ID ? null : loadNode(parentNodeId);
  if (Bytes.toString(nodeType).equals("FILE")) {
    // load file content
    byte[] fileContent = nodeMetaResult.getValue(fileContentCol);
    return new FileNode(nodeId, parent, nodeName, fileContent, /** other metadata **/);
  else {
    // load children
    byte[] children = nodeMetaResult.getValue(nodeChildrenCol);
    List<byte[]> childrenIdList = parseAsList(children); // can be Protobuf, JSON or any other data format
    List<Node> childrenList = loadChildren(childrenIdList);
    return new DirectoryNode(nodeId, parent, nodeName, childrenList, /** other metadata **/);

// and now list files and directories under requested directory
String dirToList = "/home/user1";
String[] pathParts = dirToList.split("/");
try {
  Node curNode = loadNode(ROOT_ID);
  for (String dirName : pathParts) {
    byte[] nodeId = getChildIdByName(curNode, dirName);
    Node curNode = loadNode(nodeId);
    if (!(curNode instanceof DirectoryNode)) {
      throw new IllegalOperationException("Can't list non-directory node");
  return curNode.children();
catch(NodeNotExistsException) { ... }


HBase is very mature open-source project with reach feature set. It has big community, strong committers list(Alibaba, Cloudera, Hortonworks, Salesforce, etc). Project consistently evolves and expanded by new functionality, such as, SQL by Apache Phoenix, distributed transactions by Apache Omid/Apache Tephra.
As we see, HBase has many applications in different areas: storing metrics(see OpenTSDB project), advertising data, store file system metadata and even more that we can explore.


  1. https://hbase.apache.org/book.html
  2. https://hadmin.io
  3. https://zookeeper.apache.org
  4. BigTable paper: https://ai.google/research/pubs/pub27898
  5. https://hadoop.apache.org/