Apache HBase describes itself as "the Hadoop database," which can be a bit confusing, as Hadoop is typically understood to refer to the popular MapReduce processing framework. But Hadoop is really an umbrella name for an entire ecosystem of technologies, some of which HBase uses to create a distributed, column-oriented database built on the same principles as Google's Bigtable. HBase does not use Hadoop's MapReduce capabilities directly, though HBase can integrate with Hadoop to serve as a source or destination of MapReduce jobs.
The hallmarks of HBase are extreme scalability, high reliability, and the schema flexibility you get from a column-oriented database. While tables and column families must be defined in advance, you can add new columns on the fly. HBase also offers strong row-level consistency, built-in versioning, and "coprocessors" that provide the equivalents of triggers and stored procedures.
Designed to support queries of massive data sets, HBase is optimized for read performance. For writes, HBase seeks to maintain consistency. In contrast to "eventually consistent" Cassandra, HBase does not offer various consistency level settings (to acknowledge the write after one node has written it or a quorum of nodes has written it). Thus, the price of HBase's strong consistency is that writes can be slower.
HDFS — the Hadoop Distributed File System — is the Hadoop ecosystem's foundation, and it's the file system atop which HBase resides. Designed to run on commodity hardware and tolerate member node failures, HDFS works best for batch processing systems that prefer streamed access to large data sets. This seems to make it inappropriate for the random access one would expect in database systems like HBase. But HBase takes steps to compensate for HDFS's otherwise incongruous behavior.
Zookeeper, another Hadoop technology (though no longer used by current versions of the Hadoop MapReduce engine), is a distributed communication and coordination service. Zookeeper maintains a synchronized, in-memory data structure that can be accessed by multiple clients. The data structure is organized like a file system, though the structure's components (znodes) can be data containers, as well as elements in a hierarchical tree. Imagine a file system whose files can also be directories.
HBase uses Zookeeper to coordinate cluster activities and monitor the health of member nodes. When you run an HBase cluster, you must also run Zookeeper in parallel. HBase will run and manage Zookeeper by default, though you can configure HBase to use a separately managed Zookeeper setup. You can even run the Zookeeper server processes on the same hardware as the other HBase processes, but that's not recommended, particularly for a high-volume HBase cluster.
How HBase works
The HBase data model will seem familiar at first. A table consists of rows. Each row -- fundamentally, just a blob of bytes -- is uniquely identified by a row key. The choice of row key, made when the table is created, is important because HBase uses row keys to guide data sharding -- that is, the way in which data is distributed throughout the cluster. Row keys also determine the sort order of a table's rows.
Sign up for Computerworld eNewsletters.