Tag Archives: Apache HBase

[repost ]Caching in Apache HBase: SlabCache


This was my summer internship project at Cloudera, and I’m very thankful for the level of support and mentorship I’ve received from the Apache HBase community. I started off in June with a very limited knowledge of both HBase and distributed systems in general, and by September, managed to get this patch committed to HBase trunk. I couldn’t have done this without a phenomenal amount of help from Cloudera and the greater HBase community.


The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.

However, despite the availability of high memory servers, the garbage collection algorithms available on production quality JDK’s have not caught up. Attempting to use large amounts of heap will result in the occasional stop-the-world pause that is long enough to cause stalled requests and timeouts, thus noticeably disrupting latency sensitive user applications.

Garbage Collection

The below is meant to be a quick summary of an immensely complex topic, if you would like a more detailed explanation of garbage collection, check out this post.

HBase, along with the rest of the Apache Hadoop ecosystem, is built in Java. This gives us access to an incredibly well-optimized virtual machine and an excellent mostly-concurrent garbage collector in the form of Concurrent-Mark-Sweep (CMS). However, large heaps remain a weakness, as CMS collects garbage without moving it around, potentially causing the free space to be spread throughout the heap instead of in a large contiguous chunk. Given enough time, fragmentation will require a full, stop the world, garbage collection with a copying collector capable of relocating objects. This results in a potentially long stop-the-world pause, and acts as a practical limit to the size of our heap.

Garbage collectors which do not require massive stop the world compactions do exist, but are not presently suitable for use with HBase at the moment. The Garbage-First (G1) collector included in recent versions of the JVM, is one promising example, but early testing still indicates that it exhibits some flaws. JVMs from other (non-Oracle) vendors which offer low-pause concurrent garbage collectors also exist, but they are not in widespread use by the HBase and Hadoop communities.

The Status Quo

Currently, in order to utilize all available memory, we allocate a smaller heap and let the OS utilize the rest of the memory. In this case, the memory isn’t wasted – it’s used by the filesystem cache. While this does give a noticable performance improvement, it has its drawbacks. Data in the FS cache is also treated as a file, requiring us to checksum, and verify the file.  We also have no guarantee what the FileSystem cache will do and have only limited control over the eviction policy of this cache. While the Linux FS cache is nominally a LRU cache, other processes or jobs running on the system may flush our cache, adversely impacting performance. The FS cache is better than putting the memory to waste, but it’s neither the most efficient, nor the most consistent solution.

Enter SlabCache

Another option would be to manually manage the cache within Java via Slab Allocation – opting to avoid garbage collection all together. This is the approach I implemented in HBASE-4027.

SlabCache operates by allocating a large quantity of contiguous memory, and then performing Slab Allocation within that block of memory. Buffers of likely sizes of cached objects are first allocated in advance – objects are fit into the smallest buffer available that can contain them upon caching.  Effectively, any fragmentation issues are internalized by the cache, trading off some space in order to avoid any external fragmentation issues. As blocks generally converge around a single size, this method can still be quite space efficient.


While slab allocation does not create fragmentation, other parts of HBase still can. With Slab Allocation, the frequency of stop-the-world(STW) pauses may be reduced, but the worst case maximum pause time isn’t – The JVM can still decide to move our entire slab around if we happen to be really unlucky, contributing again to significant pauses. In order to prevent this, SlabCache allocates its memory using direct ByteBuffers.

Direct ByteByffers, available in the java.nio package, are allocated outside of the normal Java heap — just like using malloc() in a C program. The garbage collector will not move memory allocated in this fashion – guaranteeing that a direct ByteBuffer will never contribute to the maximum garbage collection time. The ByteBuffer “wrapper” is then registered as an object, which when collected, is released back into the system using free.

Reads are performed using a copy-on-read approach. Every time HBase does a read from SlabCache, the data is copied out of the SlabCache and onto the heap. While passing by reference would have been the more performant solution, that would have required some way of carefully tracking references to these objects. I decided against reference counting, as reference counting opens up the potential for an entirely new class of bugs, making continuing work on HBase more difficult.  Solutions involving finalizers or reference queues, were also discarded, as neither of them guarantee timely operation. In the future, I may decide to revisit reference counting if necessary to increase read speed.

SlabCache operates as an L2 cache, replacing the FS cache in this role. The on-heap cache is maintained as the L1 cache. This solution allows us to use large amounts of memory with a substantial speed and consistency performance over the status quo, while at the same time ameliorating the downsides of the copy-on-read approach. Because the vast majority of our hits will come from the on-heap L1 cache, we do a minimum of copying data and creating new objects.


SlabCache operates at around 3-4x the performance of the file system cache, and also provides more consistent performance.

Performance comparisons of the 3 caches as followed. In each test, each cache was configured so that it was the primary (L1), and only cache of HBase. YCSB was then run against HBase-trunk.

HBase in all cases was running in Standalone mode, compiled against 0.20-append branch. As HDFS has gotten faster since the last release, I’ve also provided tests with the RawLocalFS, in order to isolate the difference between accessing the Slab cache and accessing the FS cache by removing HDFS from the equation. In this mode, CRC is turned off, and the local filesystem (ext3) is used directly. Even given these optimal conditions, SlabCache still nets a considerable performance gain.

If you’d like to try out this code in trunk, simply set MaxDirectMemorySize in hbase-env.sh. This will automatically configure configure the cache to use 95% of the MaxDirectMemorySize, and set reasonable defaults for the Slab Allocator. If finer control is desired, you are free to change the SlabCache settings in hbase-site.xml, which will allow you to have finer control over off-heap memory usage and slab allocation sizing.?


If you’re running into read performance walls with HBase, and have extra memory to spare, then please give this feature a try! This is due to be released in HBase-0.92 as an experimental feature, and will hopefully enable the more efficient usage of memory.

I had an amazing summer working on this project, and as an intern, I’m awed to see this feature work and be released publicly. If you found this post interesting, and would like to work on problems like this, check out the careerspage.

[repost ]Chapter 11. Apache HBase (TM) Performance Tuning


Chapter 11. Apache HBase (TM) Performance Tuning

Table of Contents

11.1. Operating System
11.1.1. Memory
11.1.2. 64-bit
11.1.3. Swapping
11.2. Network
11.2.1. Single Switch
11.2.2. Multiple Switches
11.2.3. Multiple Racks
11.2.4. Network Interfaces
11.3. Java
11.3.1. The Garbage Collector and Apache HBase
11.4. HBase Configurations
11.4.1. Number of Regions
11.4.2. Managing Compactions
11.4.3. hbase.regionserver.handler.count
11.4.4. hfile.block.cache.size
11.4.5. hbase.regionserver.global.memstore.upperLimit
11.4.6. hbase.regionserver.global.memstore.lowerLimit
11.4.7. hbase.hstore.blockingStoreFiles
11.4.8. hbase.hregion.memstore.block.multiplier
11.4.9. hbase.regionserver.checksum.verify
11.5. HDFS Configuration
11.5.1. Leveraging local data
11.6. ZooKeeper
11.7. Schema Design
11.7.1. Number of Column Families
11.7.2. Key and Attribute Lengths
11.7.3. Table RegionSize
11.7.4. Bloom Filters
11.7.5. ColumnFamily BlockSize
11.7.6. In-Memory ColumnFamilies
11.7.7. Compression
11.8. Writing to HBase
11.8.1. Batch Loading
11.8.2. Table Creation: Pre-Creating Regions
11.8.3. Table Creation: Deferred Log Flush
11.8.4. HBase Client: AutoFlush
11.8.5. HBase Client: Turn off WAL on Puts
11.8.6. HBase Client: Group Puts by RegionServer
11.8.7. MapReduce: Skip The Reducer
11.8.8. Anti-Pattern: One Hot Region
11.9. Reading from HBase
11.9.1. Scan Caching
11.9.2. Scan Attribute Selection
11.9.3. MapReduce – Input Splits
11.9.4. Close ResultScanners
11.9.5. Block Cache
11.9.6. Optimal Loading of Row Keys
11.9.7. Concurrency: Monitor Data Spread
11.9.8. Bloom Filters
11.10. Deleting from HBase
11.10.1. Using HBase Tables as Queues
11.10.2. Delete RPC Behavior
11.11. HDFS
11.11.1. Current Issues With Low-Latency Reads
11.11.2. Performance Comparisons of HBase vs. HDFS
11.12. Amazon EC2
11.13. Case Studies

11.1. Operating System

11.1.1. Memory

RAM, RAM, RAM. Don’t starve HBase.

11.1.2. 64-bit

Use a 64-bit platform (and 64-bit JVM).

11.1.3. Swapping

Watch out for swapping. Set swappiness to 0.

11.2. Network

Perhaps the most important factor in avoiding network issues degrading Hadoop and HBbase performance is the switching hardware that is used, decisions made early in the scope of the project can cause major problems when you double or triple the size of your cluster (or more).

Important items to consider:

  • Switching capacity of the device
  • Number of systems connected
  • Uplink capacity


11.2.1. Single Switch

The single most important factor in this configuration is that the switching capacity of the hardware is capable of handling the traffic which can be generated by all systems connected to the switch. Some lower priced commodity hardware can have a slower switching capacity than could be utilized by a full switch.

11.2.2. Multiple Switches

Multiple switches are a potential pitfall in the architecture. The most common configuration of lower priced hardware is a simple 1Gbps uplink from one switch to another. This often overlooked pinch point can easily become a bottleneck for cluster communication. Especially with MapReduce jobs that are both reading and writing a lot of data the communication across this uplink could be saturated.

Mitigation of this issue is fairly simple and can be accomplished in multiple ways:

  • Use appropriate hardware for the scale of the cluster which you’re attempting to build.
  • Use larger single switch configurations i.e. single 48 port as opposed to 2x 24 port
  • Configure port trunking for uplinks to utilize multiple interfaces to increase cross switch bandwidth.


11.2.3. Multiple Racks

Multiple rack configurations carry the same potential issues as multiple switches, and can suffer performance degradation from two main areas:

  • Poor switch capacity performance
  • Insufficient uplink to another rack

If the the switches in your rack have appropriate switching capacity to handle all the hosts at full speed, the next most likely issue will be caused by homing more of your cluster across racks. The easiest way to avoid issues when spanning multiple racks is to use port trunking to create a bonded uplink to other racks. The downside of this method however, is in the overhead of ports that could potentially be used. An example of this is, creating an 8Gbps port channel from rack A to rack B, using 8 of your 24 ports to communicate between racks gives you a poor ROI, using too few however can mean you’re not getting the most out of your cluster.

Using 10Gbe links between racks will greatly increase performance, and assuming your switches support a 10Gbe uplink or allow for an expansion card will allow you to save your ports for machines as opposed to uplinks.

11.2.4. Network Interfaces

Are all the network interfaces functioning correctly? Are you sure? See the Troubleshooting Case Study in Section 13.3.1, “Case Study #1 (Performance Issue On A Single Node)”.

11.3. Java

11.3.1. The Garbage Collector and Apache HBase Long GC pauses

In his presentation, Avoiding Full GCs with MemStore-Local Allocation Buffers, Todd Lipcon describes two cases of stop-the-world garbage collections common in HBase, especially during loading; CMS failure modes and old generation heap fragmentation brought. To address the first, start the CMS earlier than default by adding -XX:CMSInitiatingOccupancyFraction and setting it down from defaults. Start at 60 or 70 percent (The lower you bring down the threshold, the more GCing is done, the more CPU used). To address the second fragmentation issue, Todd added an experimental facility, , that must be explicitly enabled in Apache HBase 0.90.x (Its defaulted to be on in Apache 0.92.x HBase). See hbase.hregion.memstore.mslab.enabled to true in your Configuration. See the cited slides for background and detail[27].

For more information about GC logs, see Section 12.2.3, “JVM Garbage Collection Logs”.

11.4. HBase Configurations

See Section 2.5.2, “Recommended Configurations”.

11.4.1. Number of Regions

The number of regions for an HBase table is driven by the Section, “Bigger Regions”. Also, see the architecture section on Section 9.7.1, “Region Size”

11.4.2. Managing Compactions

For larger systems, managing compactions and splits may be something you want to consider.

11.4.3. hbase.regionserver.handler.count

See hbase.regionserver.handler.count.

11.4.4. hfile.block.cache.size

See hfile.block.cache.size. A memory setting for the RegionServer process.

11.4.5. hbase.regionserver.global.memstore.upperLimit

See hbase.regionserver.global.memstore.upperLimit. This memory setting is often adjusted for the RegionServer process depending on needs.

11.4.6. hbase.regionserver.global.memstore.lowerLimit

See hbase.regionserver.global.memstore.lowerLimit. This memory setting is often adjusted for the RegionServer process depending on needs.

11.4.7. hbase.hstore.blockingStoreFiles

See hbase.hstore.blockingStoreFiles. If there is blocking in the RegionServer logs, increasing this can help.

11.4.8. hbase.hregion.memstore.block.multiplier

See hbase.hregion.memstore.block.multiplier. If there is enough RAM, increasing this can help.

11.4.9. hbase.regionserver.checksum.verify

Have HBase write the checksum into the datablock and save having to do the checksum seek whenever you read. See the release note on HBASE-5074 support checksums in HBase block cache.

11.5. HDFS Configuration

11.5.1. Leveraging local data

Since Hadoop 1.0.0 (also 0.22.1, 0.23.1, CDH3u3 and HDP 1.0) via HDFS-2246, it is possible for the DFSClient to take a “short circuit” and read directly from disk instead of going through the DataNode when the data is local. What this means for HBase is that the RegionServers can read directly off their machine’s disks instead of having to open a socket to talk to the DataNode, the former being generally much faster[28]. Also see HBase, mail # dev – read short circuit thread for more discussion around short circuit reads.

To enable “short circuit” reads, you must set two configurations. First, the hdfs-site.xml needs to be amended. Set the property dfs.block.local-path-access.user to be the only user that can use the shortcut. This has to be the user that started HBase. Then in hbase-site.xml, set dfs.client.read.shortcircuit to be true

The DataNodes need to be restarted in order to pick up the new configuration. Be aware that if a process started under another username than the one configured here also has the shortcircuit enabled, it will get an Exception regarding an unauthorized access but the data will still be read.

11.6. ZooKeeper

See Chapter 16, ZooKeeper for information on configuring ZooKeeper, and see the part about having a dedicated disk.

11.7. Schema Design

11.7.1. Number of Column Families

See Section 6.2, “ On the number of column families ”.

11.7.2. Key and Attribute Lengths

See Section 6.3.2, “Try to minimize row and column sizes”. See also Section, “However…” for compression caveats.

11.7.3. Table RegionSize

The regionsize can be set on a per-table basis via setFileSize on HTableDescriptor in the event where certain tables require different regionsizes than the configured default regionsize.

See Section 11.4.1, “Number of Regions” for more information.

11.7.4. Bloom Filters

Bloom Filters can be enabled per-ColumnFamily. Use HColumnDescriptor.setBloomFilterType(NONE | ROW | ROWCOL) to enable blooms per Column Family. Default = NONE for no bloom filters. If ROW, the hash of the row will be added to the bloom on each insert. If ROWCOL, the hash of the row + column family + column family qualifier will be added to the bloom on each key insert.

See HColumnDescriptor and Section 11.9.8, “Bloom Filters” for more information or this answer up in quora, How are bloom filters used in HBase?.

11.7.5. ColumnFamily BlockSize

The blocksize can be configured for each ColumnFamily in a table, and this defaults to 64k. Larger cell values require larger blocksizes. There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is doubled then the resulting indexes should be roughly halved).

See HColumnDescriptor and Section 9.7.5, “Store”for more information.

11.7.6. In-Memory ColumnFamilies

ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily. In-memory blocks have the highest priority in the Section 9.6.4, “Block Cache”, but it is not a guarantee that the entire table will be in memory.

See HColumnDescriptor for more information.

11.7.7. Compression

Production systems should use compression with their ColumnFamily definitions. See Appendix C, Compression In HBase for more information. However…

Compression deflates data on disk. When it’s in-memory (e.g., in the MemStore) or on the wire (e.g., transferring between RegionServer and Client) it’s inflated. So while using ColumnFamily compression is a best practice, but it’s not going to completely eliminate the impact of over-sized Keys, over-sized ColumnFamily names, or over-sized Column names.

See Section 6.3.2, “Try to minimize row and column sizes” on for schema design tips, and Section, “KeyValue” for more information on HBase stores data internally.

11.8. Writing to HBase

11.8.1. Batch Loading

Use the bulk load tool if you can. See Section 9.8, “Bulk Loading”. Otherwise, pay attention to the below.

11.8.2.  Table Creation: Pre-Creating Regions

Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster. A useful pattern to speed up the bulk import process is to pre-create empty regions. Be somewhat conservative in this, because too-many regions can actually degrade performance. An example of pre-creation using hex-keys is as follows (note: this example may need to be tweaked to the individual applications keys):


public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits)
throws IOException {
  try {
    admin.createTable( table, splits );
    return true;
  } catch (TableExistsException e) {
    logger.info("table " + table.getNameAsString() + " already exists");
    // the table already exists...
    return false;  

public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
  byte[][] splits = new byte[numRegions-1][];
  BigInteger lowestKey = new BigInteger(startKey, 16);
  BigInteger highestKey = new BigInteger(endKey, 16);
  BigInteger range = highestKey.subtract(lowestKey);
  BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
  lowestKey = lowestKey.add(regionIncrement);
  for(int i=0; i < numRegions-1;i++) {
    BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
    byte[] b = String.format("%016x", key).getBytes();
    splits[i] = b;
  return splits;


11.8.3.  Table Creation: Deferred Log Flush

The default behavior for Puts using the Write Ahead Log (WAL) is that HLog edits will be written immediately. If deferred log flush is used, WAL edits are kept in memory until the flush period. The benefit is aggregated and asynchronous HLog– writes, but the potential downside is that if the RegionServer goes down the yet-to-be-flushed edits are lost. This is safer, however, than not using WAL at all with Puts.

Deferred log flush can be configured on tables via HTableDescriptor. The default value of hbase.regionserver.optionallogflushinterval is 1000ms.

11.8.4. HBase Client: AutoFlush

When performing a lot of Puts, make sure that setAutoFlush is set to false on your HTable instance. Otherwise, the Puts will be sent one at a time to the RegionServer. Puts added via htable.add(Put) and htable.add( <List> Put) wind up in the same write buffer. If autoFlush = false, these messages are not sent until the write-buffer is filled. To explicitly flush the messages, call flushCommits. Calling close on the HTable instance will invoke flushCommits.

11.8.5. HBase Client: Turn off WAL on Puts

A frequently discussed option for increasing throughput on Puts is to call writeToWAL(false). Turning this off means that the RegionServer will not write the Put to the Write Ahead Log, only into the memstore, HOWEVER the consequence is that if there is a RegionServer failure there will be data loss. If writeToWAL(false) is used, do so with extreme caution. You may find in actuality that it makes little difference if your load is well distributed across the cluster.

In general, it is best to use WAL for Puts, and where loading throughput is a concern to use bulk loading techniques instead.

11.8.6. HBase Client: Group Puts by RegionServer

In addition to using the writeBuffer, grouping Puts by RegionServer can reduce the number of client RPC calls per writeBuffer flush. There is a utility HTableUtil currently on TRUNK that does this, but you can either copy that or implement your own verison for those still on 0.90.x or earlier.

11.8.7. MapReduce: Skip The Reducer

When writing a lot of data to an HBase table from a MR job (e.g., with TableOutputFormat), and specifically where Puts are being emitted from the Mapper, skip the Reducer step. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other Reducers that will most likely be off-node. It’s far more efficient to just write directly to HBase.

For summary jobs where HBase is used as a source and a sink, then writes will be coming from the Reducer step (e.g., summarize values then write out result). This is a different processing problem than from the the above case.

11.8.8. Anti-Pattern: One Hot Region

If all your data is being written to one region at a time, then re-read the section on processing timeseries data.

Also, if you are pre-splitting regions and all your data is still winding up in a single region even though your keys aren’t monotonically increasing, confirm that your keyspace actually works with the split strategy. There are a variety of reasons that regions may appear “well split” but won’t work with your data. As the HBase client communicates directly with the RegionServers, this can be obtained via HTable.getRegionLocation.

See Section 11.8.2, “ Table Creation: Pre-Creating Regions ”, as well as Section 11.4, “HBase Configurations”

11.9. Reading from HBase

11.9.1. Scan Caching

If HBase is used as an input source for a MapReduce job, for example, make sure that the input Scan instance to the MapReduce job has setCaching set to something greater than the default (which is 1). Using the default value means that the map-task will make call back to the region-server for every record processed. Setting this value to 500, for example, will transfer 500 rows at a time to the client to be processed. There is a cost/benefit to have the cache value be large because it costs more in memory for both client and RegionServer, so bigger isn’t always better. Scan Caching in MapReduce Jobs

Scan settings in MapReduce jobs deserve special attention. Timeouts can result (e.g., UnknownScannerException) in Map tasks if it takes longer to process a batch of records before the client goes back to the RegionServer for the next set of data. This problem can occur because there is non-trivial processing occuring per row. If you process rows quickly, set caching higher. If you process rows more slowly (e.g., lots of transformations per row, writes), then set caching lower.

Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBase client doing a Scan), but the processing that is often performed in MapReduce jobs tends to exacerbate this issue.

11.9.2. Scan Attribute Selection

Whenever a Scan is used to process large numbers of rows (and especially when used as a MapReduce source), be aware of which attributes are selected. If scan.addFamily is called then all of the attributes in the specified ColumnFamily will be returned to the client. If only a small number of the available attributes are to be processed, then only those attributes should be specified in the input scan because attribute over-selection is a non-trivial performance penalty over large datasets.

11.9.3. MapReduce – Input Splits

For MapReduce jobs that use HBase tables as a source, if there a pattern where the “slow” map tasks seem to have the same Input Split (i.e., the RegionServer serving the data), see the Troubleshooting Case Study in Section 13.3.1, “Case Study #1 (Performance Issue On A Single Node)”.

11.9.4. Close ResultScanners

This isn’t so much about improving performance but rather avoiding performance problems. If you forget to close ResultScanners you can cause problems on the RegionServers. Always have ResultScanner processing enclosed in try/catch blocks…

Scan scan = new Scan();
// set attrs...
ResultScanner rs = htable.getScanner(scan);
try {
  for (Result r = rs.next(); r != null; r = rs.next()) {
  // process result...
} finally {
  rs.close();  // always close the ResultScanner!

11.9.5. Block Cache

Scan instances can be set to use the block cache in the RegionServer via the setCacheBlocks method. For input Scans to MapReduce jobs, this should be false. For frequently accessed rows, it is advisable to use the block cache.

11.9.6. Optimal Loading of Row Keys

When performing a table scan where only the row keys are needed (no families, qualifiers, values or timestamps), add a FilterList with a MUST_PASS_ALL operator to the scanner using setFilter. The filter list should include both a FirstKeyOnlyFilter and a KeyOnlyFilter. Using this filter combination will result in a worst case scenario of a RegionServer reading a single value from disk and minimal network traffic to the client for a single row.

11.9.7. Concurrency: Monitor Data Spread

When performing a high number of concurrent reads, monitor the data spread of the target tables. If the target table(s) have too few regions then the reads could likely be served from too few nodes.

See Section 11.8.2, “ Table Creation: Pre-Creating Regions ”, as well as Section 11.4, “HBase Configurations”

11.9.8. Bloom Filters

Enabling Bloom Filters can save your having to go to disk and can help improve read latencys.

Bloom filters were developed over in HBase-1200 Add bloomfilters.[29][30]

See also Section 11.7.4, “Bloom Filters”. Bloom StoreFile footprint

Bloom filters add an entry to the StoreFile general FileInfo data structure and then two extra entries to the StoreFile metadata section. BloomFilter in the StoreFile FileInfo data structure

FileInfo has a BLOOM_FILTER_TYPE entry which is set to NONE, ROW or ROWCOL. BloomFilter entries in StoreFile metadata

BLOOM_FILTER_META holds Bloom Size, Hash Function used, etc. Its small in size and is cached on StoreFile.Reader load

BLOOM_FILTER_DATA is the actual bloomfilter data. Obtained on-demand. Stored in the LRU cache, if it is enabled (Its enabled by default). Bloom Filter Configuration io.hfile.bloom.enabled global kill switch

io.hfile.bloom.enabled in Configuration serves as the kill switch in case something goes wrong. Default = true. io.hfile.bloom.error.rate

io.hfile.bloom.error.rate = average false positive rate. Default = 1%. Decrease rate by ½ (e.g. to .5%) == +1 bit per bloom entry. io.hfile.bloom.max.fold

io.hfile.bloom.max.fold = guaranteed minimum fold rate. Most people should leave this alone. Default = 7, or can collapse to at least 1/128th of original size. See the Development Process section of the document BloomFilters in HBase for more on what this option means.

11.10. Deleting from HBase

11.10.1. Using HBase Tables as Queues

HBase tables are sometimes used as queues. In this case, special care must be taken to regularly perform major compactions on tables used in this manner. As is documented in Chapter 5, Data Model, marking rows as deleted creates additional StoreFiles which then need to be processed on reads. Tombstones only get cleaned up with major compactions.

See also Section, “Compaction” and HBaseAdmin.majorCompact.

11.10.2. Delete RPC Behavior

Be aware that htable.delete(Delete) doesn’t use the writeBuffer. It will execute an RegionServer RPC with each invocation. For a large number of deletes, consider htable.delete(List).

See http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#delete%28org.apache.hadoop.hbase.client.Delete%29

11.11. HDFS

Because HBase runs on Section 9.9, “HDFS” it is important to understand how it works and how it affects HBase.

11.11.1. Current Issues With Low-Latency Reads

The original use-case for HDFS was batch processing. As such, there low-latency reads were historically not a priority. With the increased adoption of Apache HBase this is changing, and several improvements are already in development. See the Umbrella Jira Ticket for HDFS Improvements for HBase.

11.11.2. Performance Comparisons of HBase vs. HDFS

A fairly common question on the dist-list is why HBase isn’t as performant as HDFS files in a batch context (e.g., as a MapReduce source or sink). The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues, returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this processing context. Not that there isn’t room for improvement (and this gap will, over time, be reduced), but HDFS will always be faster in this use-case.

11.12. Amazon EC2

Performance questions are common on Amazon EC2 environments because it is a shared environment. You will not see the same throughput as a dedicated server. In terms of running tests on EC2, run them several times for the same reason (i.e., it’s a shared environment and you don’t know what else is happening on the server).

If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that because EC2 issues are practically a separate class of performance issues.

11.13. Case Studies

For Performance and Troubleshooting Case Studies, see Chapter 13, Apache HBase (TM) Case Studies.

[27] The latest jvms do better regards fragmentation so make sure you are running a recent release. Read down in the message, Identifying concurrent mode failures caused by fragmentation.

[28] See JD’s Performance Talk

[29] For description of the development process — why static blooms rather than dynamic — and for an overview of the unique properties that pertain to blooms in HBase, as well as possible future directions, see the Development Process section of the document BloomFilters in HBase attached to HBase-1200.

[30] The bloom filters described here are actually version two of blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom option based on work done by the European Commission One-Lab Project 034819. The core of the HBase bloom work was later pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile. Version 1 of HBase blooms never worked that well. Version 2 is a rewrite from scratch though again it starts with the one-lab work.

[repost ]Apache HBase on Amazon EMR – Real-time Access to Your Big Data


All Your Base
AWS has already given you a lot of storage and processing options to choose from, and today we are adding a really important one.

You can now use Apache HBase to store and process extremely large amounts of data (think billions of rows and millions of columns per row) on AWS. HBase offers a number of powerful features including:

  • Strictly consistent reads and writes.
  • High write throughput.
  • Automatic sharding of tables.
  • Efficient storage of sparse data.
  • Low-latency data access via in-memory operations.
  • Direct input and output to Hadoop jobs.
  • Integration with Apache Hive for SQL-like queries over HBase tables, joins, and JDBC support.

HBase is formally part of the Apache Hadoop project, and runs within Amazon Elastic MapReduce. You can launch HBase jobs (version 0.92.0) from the command line or the AWS Management Console.

HBase in Action
HBase has been optimized for low-latency lookups and range scans, with efficient updates and deletions of individual records. Here are some of the things that you can do with it:

Reference Data for Hadoop Analytics – Because HBase is integrated into Hadoop and Hive and provides rapid access to stored data, it is a great way to store reference data that will be used by one or more Hadoop jobs on a single cluster or across multiple Hadoop clusters.

Log Ingestion and Batch Analytics – HBase can handle real-time ingestion of log data with ease, thanks to its high write throughput and efficient storage of sparse data. Combining this with Hadoop’s ability to handle sequential reads and scans in a highly optimized fashion, and you have a powerful tool for log analysis.

Storage for High Frequency Counters and Summary Data – HBase supports high update rates (the classic read-modify-write) along with strictly consistent reads and writes. These features make it ideal for storing counters and summary data. Complex aggregations such as max-min, sum, average, and group-by can be run as Hadoop jobs and the results can be piped back into an HBase table.

I should point out that HBase on EMR runs in a single Availability Zone and does not guarantee data durability; data stored in an HBase cluster can be lost if the master node in the cluster fails. Hence, HBase should be used for summarization or secondary data or you should make use of the backup feature described below.

You can do all of this (and a lot more) by running HBase on AWS. You’ll get all sorts of benefits when you do so:

Freedom from Drudgery – You can focus on your business and on your customers. You don’t have to set up, manage, or tune your HBase clusters. Elastic MapReduce will handle provisioning of EC2 instances, security settings, HBase configuration, log collection, health monitoring, and replacement of faulty instances. You can even expand the size of your HBase cluster with a single API call.

Backup and Recovery – You can schedule full and incremental backups of your HBase data to Amazon S3. You can rollback to an old backup on an existing cluster or you can restore a backup to a newly launched cluster.

Seamless AWS Integration – HBase on Elastic MapReduce was designed to work smoothly and seamlessly with other AWS services such as S3, DynamoDB, EC2, and CloudWatch.

Getting Started
You can start HBase from the command line by launching your Elastic MapReduce cluster with the –hbase flag :

$ elastic-mapreduce –create –hbase –name “Jeff’s HBase Cluster” –num-instances 2 –instance-type m1.large

You can also start it from the Create New Cluster page of the AWS Management Console:

When you create your HBase Job Flow from the console you can restore from an existing backup, and you can also schedule future backups:

Beyond the Basics
Here are a couple of advanced features and options that might be of interest to you:

You can modify your HBase configuration at launch time by using an EMR bootstrap action. For example, you can alter the maximum file size (hbase.hregion.max.filesize) or the maximum size of the memstore (hbase.regionserver.global.memstore.upperLimit).

You can monitor your cluster with the standard CloudWatch metrics that are generated for all Elastic MapReduce job flows. You can also install Ganglia at startup time by invoking a pair of predefined bootstrap actions (install-ganglia and configure-hbase-for-ganglia). We plan to add additional metrics, specific to HBase, over time.

You can run Apache Hive on the same cluster, or you can install it on a separate cluster. Hive will run queries transparently against HBase and Hive tables. We do advise you to proceed with care when running both on the same cluster; HBase is CPU and memory intensive, while most other MapReduce jobs are I/O bound, with fixed memory requirements and sporadic CPU usage.

HBase job flows are always launched with EC2 Termination Protection enabled. You will need to confirm your intent to terminate the job flow.

I hope you enjoy this powerful new feature!

— Jeff;

PS – There is no extra charge to run HBase. You pay the usual rates for Elastic MapReduce and EC2.

[repost ]Apache HBase (TM) and ACID


About this Document

Apache HBase (TM) is not an ACID compliant database. However, it does guarantee certain specific properties.

This specification enumerates the ACID properties of HBase.


For the sake of common vocabulary, we define the following terms:

an operation is atomic if it either completes entirely or not at all
all actions cause the table to transition from one valid state directly to another (eg a row will not disappear during an update, etc)
an operation is isolated if it appears to complete independently of any other concurrent transaction
any update that reports “successful” to the client will not be lost
an update is considered visible if any subsequent read will see the update as having been committed

The terms must and may are used as specified by RFC 2119. In short, the word “must” implies that, if some case exists where the statement is not true, it is a bug. The word “may” implies that, even if the guarantee is provided in a current release, users should not rely on it.

APIs to consider

  • Read APIs
    • get
    • scan
  • Write APIs
    • put
    • batch put
    • delete
  • Combination (read-modify-write) APIs
    • incrementColumnValue
    • checkAndPut

Guarantees Provided


  1. All mutations are atomic within a row. Any put will either wholely succeed or wholely fail.[3]
    1. An operation that returns a “success” code has completely succeeded.
    2. An operation that returns a “failure” code has completely failed.
    3. An operation that times out may have succeeded and may have failed. However, it will not have partially succeeded or failed.
  2. This is true even if the mutation crosses multiple column families within a row.
  3. APIs that mutate several rows will _not_ be atomic across the multiple rows. For example, a multiput that operates on rows ‘a’,’b’, and ‘c’ may return having mutated some but not all of the rows. In such cases, these APIs will return a list of success codes, each of which may be succeeded, failed, or timed out as described above.
  4. The checkAndPut API happens atomically like the typical compareAndSet (CAS) operation found in many hardware architectures.
  5. The order of mutations is seen to happen in a well-defined order for each row, with no interleaving. For example, if one writer issues the mutation “a=1,b=1,c=1” and another writer issues the mutation “a=2,b=2,c=2”, the row must either be “a=1,b=1,c=1” or “a=2,b=2,c=2” and must notbe something like “a=1,b=2,c=1”.
    1. Please note that this is not true _across rows_ for multirow batch mutations.

Consistency and Isolation

  1. All rows returned via any access API will consist of a complete row that existed at some point in the table’s history.
  2. This is true across column families – i.e a get of a full row that occurs concurrent with some mutations 1,2,3,4,5 will return a complete row that existed at some point in time between mutation i and i+1 for some i between 1 and 5.
  3. The state of a row will only move forward through the history of edits to it.

Consistency of Scans

A scan is not a consistent view of a table. Scans do not exhibit snapshot isolation.

Rather, scans have the following properties:

  1. Any row returned by the scan will be a consistent view (i.e. that version of the complete row existed at some point in time) [1]
  2. A scan will always reflect a view of the data at least as new asthe beginning of the scan. This satisfies the visibility guarantees enumerated below.
    1. For example, if client A writes data X and then communicates via a side channel to client B, any scans started by client B will contain data at least as new as X.
    2. A scan _must_ reflect all mutations committed prior to the construction of the scanner, and _may_ reflect some mutations committed subsequent to the construction of the scanner.
    3. Scans must include all data written prior to the scan (except in the case where data is subsequently mutated, in which case it _may_ reflect the mutation)

Those familiar with relational databases will recognize this isolation level as “read committed”.

Please note that the guarantees listed above regarding scanner consistency are referring to “transaction commit time”, not the “timestamp” field of each cell. That is to say, a scanner started at time t may see edits with a timestamp value greater than t, if those edits were committed with a “forward dated” timestamp before the scanner was constructed.


  1. When a client receives a “success” response for any mutation, that mutation is immediately visible to both that client and any client with whom it later communicates through side channels. [3]
  2. A row must never exhibit so-called “time-travel” properties. That is to say, if a series of mutations moves a row sequentially through a series of states, any sequence of concurrent reads will return a subsequence of those states.
    1. For example, if a row’s cells are mutated using the “incrementColumnValue” API, a client must never see the value of any cell decrease.
    2. This is true regardless of which read API is used to read back the mutation.
  3. Any version of a cell that has been returned to a read operation is guaranteed to be durably stored.


  1. All visible data is also durable data. That is to say, a read will never return data that has not been made durable on disk[2]
  2. Any operation that returns a “success” code (eg does not throw an exception) will be made durable.[3]
  3. Any operation that returns a “failure” code will not be made durable (subject to the Atomicity guarantees above)
  4. All reasonable failure scenarios will not affect any of the guarantees of this document.


All of the above guarantees must be possible within Apache HBase. For users who would like to trade off some guarantees for performance, HBase may offer several tuning options. For example:

  • Visibility may be tuned on a per-read basis to allow stale reads or time travel.
  • Durability may be tuned to only flush data to disk on a periodic basis

More Information

For more information, see the client architecture or data model sections in the Apache HBase Reference Guide.


[1] A consistent view is not guaranteed intra-row scanning — i.e. fetching a portion of a row in one RPC then going back to fetch another portion of the row in a subsequent RPC. Intra-row scanning happens when you set a limit on how many values to return per Scan#next (See Scan#setBatch(int)).

[2] In the context of Apache HBase, “durably on disk” implies an hflush() call on the transaction log. This does not actually imply an fsync() to magnetic media, but rather just that the data has been written to the OS cache on all replicas of the log. In the case of a full datacenter power loss, it is possible that the edits are not truly durable.

[3] Puts will either wholely succeed or wholely fail, provided that they are actually sent to the RegionServer. If the writebuffer is used, Puts will not be sent until the writebuffer is filled or it is explicitly flushed.

[repost ]The Apache HBase™ Reference Guide


Revision History
Revision 0.95-SNAPSHOT 2012-11-02T14:06


This is the official reference guide of Apache HBase (TM), a distributed, versioned, column-oriented database built on top of Apache Hadoop and Apache ZooKeeper.

Table of Contents

1. Getting Started
1.1. Introduction
1.2. Quick Start
2. Apache HBase (TM) Configuration
2.1. Basic Prerequisites
2.2. HBase run modes: Standalone and Distributed
2.3. Configuration Files
2.4. Example Configurations
2.5. The Important Configurations
3. Upgrading
3.1. Upgrading to HBase 0.90.x from 0.20.x or 0.89.x
3.2. Upgrading from 0.90.x to 0.92.x
4. The Apache HBase Shell
4.1. Scripting
4.2. Shell Tricks
5. Data Model
5.1. Conceptual View
5.2. Physical View
5.3. Table
5.4. Row
5.5. Column Family
5.6. Cells
5.7. Data Model Operations
5.8. Versions
5.9. Sort Order
5.10. Column Metadata
5.11. Joins
6. HBase and Schema Design
6.1. Schema Creation
6.2. On the number of column families
6.3. Rowkey Design
6.4. Number of Versions
6.5. Supported Datatypes
6.6. Joins
6.7. Time To Live (TTL)
6.8. Keeping Deleted Cells
6.9. Secondary Indexes and Alternate Query Paths
6.10. Schema Design Smackdown
6.11. Operational and Performance Configuration Options
6.12. Constraints
7. HBase and MapReduce
7.1. Map-Task Spitting
7.2. HBase MapReduce Examples
7.3. Accessing Other HBase Tables in a MapReduce Job
7.4. Speculative Execution
8. Secure Apache HBase (TM)
8.1. Secure Client Access to Apache HBase
8.2. Access Control
9. Architecture
9.1. Overview
9.2. Catalog Tables
9.3. Client
9.4. Client Request Filters
9.5. Master
9.6. RegionServer
9.7. Regions
9.8. Bulk Loading
9.9. HDFS
10. Apache HBase (TM) External APIs
10.1. Non-Java Languages Talking to the JVM
10.2. REST
10.3. Thrift
10.4. C/C++ Apache HBase Client
11. Apache HBase (TM) Performance Tuning
11.1. Operating System
11.2. Network
11.3. Java
11.4. HBase Configurations
11.5. HDFS Configuration
11.6. ZooKeeper
11.7. Schema Design
11.8. Writing to HBase
11.9. Reading from HBase
11.10. Deleting from HBase
11.11. HDFS
11.12. Amazon EC2
11.13. Case Studies
12. Troubleshooting and Debugging Apache HBase (TM)
12.1. General Guidelines
12.2. Logs
12.3. Resources
12.4. Tools
12.5. Client
12.6. MapReduce
12.7. NameNode
12.8. Network
12.9. RegionServer
12.10. Master
12.11. ZooKeeper
12.12. Amazon EC2
12.13. HBase and Hadoop version issues
12.14. Case Studies
13. Apache HBase (TM) Case Studies
13.1. Overview
13.2. Schema Design
13.3. Performance/Troubleshooting
14. Apache HBase (TM) Operational Management
14.1. HBase Tools and Utilities
14.2. Region Management
14.3. Node Management
14.4. HBase Metrics
14.5. HBase Monitoring
14.6. Cluster Replication
14.7. HBase Backup
14.8. Capacity Planning
15. Building and Developing Apache HBase (TM)
15.1. Apache HBase Repositories
15.2. IDEs
15.3. Building Apache HBase
15.4. Adding an Apache HBase release to Apache’s Maven Repository
15.5. Updating hbase.apache.org
15.6. Tests
15.7. Maven Build Commands
15.8. Getting Involved
15.9. Developing
15.10. Submitting Patches
16. ZooKeeper
16.1. Using existing ZooKeeper ensemble
16.2. SASL Authentication with ZooKeeper
17. Community
17.1. Decisions
17.2. Community Roles
B. hbck In Depth
B.1. Running hbck to identify inconsistencies
B.2. Inconsistencies
B.3. Localized repairs
B.4. Region Overlap Repairs
C. Compression In HBase
C.1. CompressionTest Tool
C.2. hbase.regionserver.codecs
C.3. LZO
C.6. Changing Compression Schemes
D. YCSB: The Yahoo! Cloud Serving Benchmark and HBase
E. HFile format version 2
E.1. Motivation
E.2. HFile format version 1 overview
E.3. HBase file format with inline blocks (version 2)
F. Other Information About HBase
F.1. HBase Videos
F.2. HBase Presentations (Slides)
F.3. HBase Papers
F.4. HBase Sites
F.5. HBase Books
F.6. Hadoop Books
G. HBase History
H. HBase and the Apache Software Foundation
H.1. ASF Development Process
H.2. ASF Board Reporting
I. Enabling Dapper-like Tracing in HBase
I.1. SpanReceivers
I.2. Client Modifications

[repost ]intricacies of running Apache HBase on Windows using Cygwin



Apache HBase (TM) is a distributed, column-oriented store, modeled after Google’s BigTable. Apache HBase is built on top of Hadoop for its MapReduce and distributed file system implementation. All these projects are open-source and part of the Apache Software Foundation.

As being distributed, large scale platforms, the Hadoop and HBase projects mainly focus on *nix environments for production installations. However, being developed in Java, both projects are fully portable across platforms and, hence, also to the Windows operating system. For ease of development the projects rely on Cygwin to have a *nix-like environment on Windows to run the shell scripts.


This document explains the intricacies of running Apache HBase on Windows using Cygwin as an all-in-one single-node installation for testing and development. The HBase Overview and QuickStart guides on the other hand go a long way in explaning how to setup HBase in more complex deployment scenario’s.


For running Apache HBase on Windows, 3 technologies are required: Java, Cygwin and SSH. The following paragraphs detail the installation of each of the aforementioned technologies.


HBase depends on the Java Platform, Standard Edition, 6 Release. So the target system has to be provided with at least the Java Runtime Environment (JRE); however if the system will also be used for development, the Jave Development Kit (JDK) is preferred. You can download the latest versions for both from Sun’s download page. Installation is a simple GUI wizard that guides you through the process.


Cygwin is probably the oddest technology in this solution stack. It provides a dynamic link library that emulates most of a *nix environment on Windows. On top of that a whole bunch of the most common *nix tools are supplied. Combined, the DLL with the tools form a very *nix-alike environment on Windows.

For installation, Cygwin provides the setup.exe utility that tracks the versions of all installed components on the target system and provides the mechanism for installing or updating everything from the mirror sites of Cygwin.

To support installation, the setup.exe utility uses 2 directories on the target system. The Root directory for Cygwin (defaults to C:\cygwin) which will become / within the eventual Cygwin installation; and the Local Package directory (e.g. C:\cygsetup that is the cache where setup.exe stores the packages before they are installed. The cache must not be the same folder as the Cygwin root.

Perform following steps to install Cygwin, which are elaboratly detailed in the 2nd chapter of the Cygwin User’s Guide:

  1. Make sure you have Administrator privileges on the target system.
  2. Choose and create you Root and Local Package directories. A good suggestion is to use C:\cygwin\root and C:\cygwin\setup folders.
  3. Download the setup.exe utility and save it to the Local Package directory.
  4. Run the setup.exeutility,
    1. Choose the Install from Internet option,
    2. Choose your Root and Local Package folders
    3. and select an appropriate mirror.
    4. Don’t select any additional packages yet, as we only want to install Cygwin for now.
    5. Wait for download and install
    6. Finish the installation
  5. Optionally, you can now also add a shortcut to your Start menu pointing to the setup.exe utility in the Local Package folder.
  6. Add CYGWIN_HOME system-wide environment variable that points to your Root directory.
  7. Add %CYGWIN_HOME%\bin to the end of your PATH environment variable.
  8. Reboot the sytem after making changes to the environment variables otherwise the OS will not be able to find the Cygwin utilities.
  9. Test your installation by running your freshly created shortcuts or the Cygwin.bat command in the Root folder. You should end up in a terminal window that is running a Bash shell. Test the shell by issuing following commands:
    1. cd / should take you to thr Root directory in Cygwin;
    2. the LS commands that should list all files and folders in the current directory.
    3. Use the exit command to end the terminal.
  10. When needed, to uninstall Cygwin you can simply delete the Root and Local Package directory, and the shortcuts that were created during installation.


HBase (and Hadoop) rely on SSH for interprocess/-node communication and launching remote commands. SSH will be provisioned on the target system via Cygwin, which supports running Cygwin programs as Windows services!

  1. Rerun the setup.exe utility.
  2. Leave all parameters as is, skipping through the wizard using the Next button until the Select Packages panel is shown.
  3. Maximize the window and click the View button to toggle to the list view, which is ordered alfabetically on Package, making it easier to find the packages we’ll need.
  4. Select the following packages by clicking the status word (normally Skip) so it’s marked for installation. Use the Next button to download and install the packages.
    1. OpenSSH
    2. tcp_wrappers
    3. diffutils
    4. zlib
  5. Wait for the install to complete and finish the installation.


Download the latest release of Apache HBase from the website. As the Apache HBase distributable is just a zipped archive, installation is as simple as unpacking the archive so it ends up in its final installation directory. Notice that HBase has to be installed in Cygwin and a good directory suggestion is to use /usr/local/ (or [Root directory]\usr\local in Windows slang). You should end up with a /usr/local/hbase-<version> installation in Cygwin.

This finishes installation. We go on with the configuration.


There are 3 parts left to configure: Java, SSH and HBase itself. Following paragraphs explain eacht topic in detail.


One important thing to remember in shell scripting in general (i.e. *nix and Windows) is that managing, manipulating and assembling path names that contains spaces can be very hard, due to the need to escape and quote those characters and strings. So we try to stay away from spaces in path names. *nix environments can help us out here very easily by using symbolic links.

  1. Create a link in /usr/localto the Java home directory by using the following command and substituting the name of your chosen Java environment:
    LN -s /cygdrive/c/Program\ Files/Java/<jre name> /usr/local/<jre name>
  2. Test your java installation by changing directories to your Java folder CD /usr/local/<jre name> and issueing the command ./bin/java -version. This should output your version of the chosen JRE.

Configuring SSH is quite elaborate, but primarily a question of launching it by default as a Windows service.

  1. On Windows Vista and above make sure you run the Cygwin shell with elevated privileges, by right-clicking on the shortcut an using Run as Administrator.
  2. First of all, we have to make sure the rights on some crucial files are correct. Use the commands underneath. You can verify all rights by using the LS -L command on the different files. Also, notice the auto-completion feature in the shell using <TAB>is extremely handy in these situations.
    1. chmod +r /etc/passwd to make the passwords file readable for all
    2. chmod u+w /etc/passwd to make the passwords file writable for the owner
    3. chmod +r /etc/group to make the groups file readable for all
    1. chmod u+w /etc/group to make the groups file writable for the owner
    1. chmod 755 /var to make the var folder writable to owner and readable and executable to all
  3. Edit the /etc/hosts.allow file using your favorite editor (why not VI in the shell!) and make sure the following two lines are in there before the PARANOIDline:
    1. ALL : localhost : allow
    2. ALL : [::1]/128 : allow
  4. Next we have to configure SSH by using the script ssh-host-config
    1. If this script asks to overwrite an existing /etc/ssh_config, answer yes.
    2. If this script asks to overwrite an existing /etc/sshd_config, answer yes.
    3. If this script asks to use privilege separation, answer yes.
    4. If this script asks to install sshd as a service, answer yes. Make sure you started your shell as Adminstrator!
    5. If this script asks for the CYGWIN value, just <enter> as the default is ntsec.
    6. If this script asks to create the sshd account, answer yes.
    7. If this script asks to use a different user name as service account, answer no as the default will suffice.
    8. If this script asks to create the cyg_server account, answer yes. Enter a password for the account.
  5. Start the SSH service using net start sshd or cygrunsrv --start sshd. Notice that cygrunsrv is the utility that make the process run as a Windows service. Confirm that you see a message stating that the CYGWIN sshd service was started succesfully.
  6. Harmonize Windows and Cygwin user accountby using the commands:
    1. mkpasswd -cl > /etc/passwd
    2. mkgroup --local > /etc/group
  7. Test the installation of SSH:
    1. Open a new Cygwin terminal
    2. Use the command whoami to verify your userID
    3. Issue an ssh localhostto connect to the system itself
      1. Answer yes when presented with the server’s fingerprint
      2. Issue your password when prompted
      3. test a few commands in the remote session
      4. The exit command should take you back to your first shell in Cygwin
    4. Exit should terminate the Cygwin shell.


If all previous configurations are working properly, we just need some tinkering at the HBase config files to properly resolve on Windows/Cygwin. All files and paths referenced here start from the HBase [installation directory]as working directory.

  1. HBase uses the ./conf/hbase-env.shto configure its dependencies on the runtime environment. Copy and uncomment following lines just underneath their original, change them to fit your environemnt. They should read something like:
    1. export JAVA_HOME=/usr/local/<jre name>
    2. export HBASE_IDENT_STRING=$HOSTNAME as this most likely does not inlcude spaces.
  2. HBase uses the ./conf/hbase-default.xml file for configuration. Some properties do not resolve to existing directories because the JVM runs on Windows. This is the major issue to keep in mind when working with Cygwin: within the shell all paths are *nix-alike, hence relative to the root /. However, every parameter that is to be consumed within the windows processes themself, need to be Windows settings, hence C:\-alike. Change following propeties in the configuration file, adjusting paths where necessary to conform with your own installation:
    1. hbase.rootdir must read e.g. file:///C:/cygwin/root/tmp/hbase/data
    2. hbase.tmp.dir must read C:/cygwin/root/tmp/hbase/tmp
    3. hbase.zookeeper.quorum must read because for some reason localhost doesn’t seem to resolve properly on Cygwin.
  3. Make sure the configured hbase.rootdir and hbase.tmp.dir directories exist and have the proper rights set up e.g. by issuing a chmod 777 on them.

This should conclude the installation and configuration of Apache HBase on Windows using Cygwin. So it’s time to test it.

  1. Start a Cygwin terminal, if you haven’t already.
  2. Change directory to HBase installation using CD /usr/local/hbase-<version>, preferably using auto-completion.
  3. Start HBase using the command ./bin/start-hbase.sh
    1. When prompted to accept the SSH fingerprint, answer yes.
    2. When prompted, provide your password. Maybe multiple times.
    3. When the command completes, the HBase server should have started.
    4. However, to be absolutely certain, check the logs in the ./logs directory for any exceptions.
  4. Next we start the HBase shell using the command ./bin/hbase shell
  5. We run some simple test commands
    1. Create a simple table using command create 'test', 'data'
    2. Verify the table exists using the command list
    3. Insert data into the table using e.g.
      put 'test', 'row1', 'data:1', 'value1'
      put 'test', 'row2', 'data:2', 'value2'
      put 'test', 'row3', 'data:3', 'value3'
    4. List all rows in the table using the command scan 'test' that should list all the rows previously inserted. Notice how 3 new columns where added without changing the schema!
    5. Finally we get rid of the table by issuing disable 'test' followed by drop 'test' and verified by list which should give an empty listing.
  6. Leave the shell by exit
  7. To stop the HBase server issue the ./bin/stop-hbase.sh command. And wait for it to complete!!! Killing the process might corrupt your data on disk.
  8. In case of problems,
    1. verify the HBase logs in the ./logs directory.
    2. Try to fix the problem
    3. Get help on the forums or IRC (#hbase@freenode.net). People are very active and keen to help out!
    4. Stopr, restart and retest the server.


Now your HBase server is running, start coding and build that next killer app on this particular, but scalable datastore!

[repost ]Apache Hbase Coprocessor Introduction


Authors: Trend Micro Hadoop Group: Mingjie Lai, Eugene Koontz, Andrew Purtell

(The original version of the blog was posted at http://hbaseblog.com/2010/11/30/hbase-coprocessors/ in late 2010, however the site is no longer available. Since we decided to move all blog posts to the official Apache blog, here I just recover the original post and make quite some modifications to reflect the latest coprocessor changes in HBase 0.92.)

HBase has very effective MapReduce integration for distributed computation over data stored within its tables, but in many cases – for example simple additive or aggregating operations like summing, counting, and the like – pushing the computation up to the server where it can operate on the data directly without communication overheads can give a dramatic performance improvement over HBase’s already good scanning performance.

Also, before 0.92, it was not possible to extend HBase with custom functionality except by extending the base classes. Due to Java’s lack of multiple inheritance this required extension plus base code to be refactored into a single class providing the full implementation, which quickly becomes brittle when considering multiple extensions. Who inherits from whom? Coprocessors allow a much more flexible mixin extension model.

In this article I will introduce the new Coprocessors feature of HBase, a framework for both flexible and generic extension, and of distributed computation directly within the HBase server processes. I will talk about what it is, how it works, and how to develop coprocessor extensions.

The idea of HBase Coprocessors was inspired by Google’s BigTable coprocessors. Jeff Dean gave a talk at LADIS ’09 (http://www.scribd.com/doc/21631448/Dean-Keynote-Ladis2009, page 66-67), and mentioned some basic ideas of coprocessors, which Google developed to bring computing parallelism to BigTable. They have the following characteristics:

  • Arbitrary code can run at each tablet in table server
  • High-level call interface for clients
    • Calls are addressed to rows or ranges of rows and the coprocessor client library resolves them to actual locations;
    • Calls across multiple rows are automatically split into multiple parallelized RPC
  • Provides a very flexible model for building distributed services
  • Automatic scaling, load balancing, request routing for applications

Back to HBase, we definitely want to support efficient computational parallelism as well, beyond what Hadoop MapReduce can provide. In addition, exciting new features can be built on top of it, for example secondary indexing, complex filtering (push down predicates), and access control. HBase coprocessors are inspired by BigTable coprocessors but are divergent in implementation detail. What we have built is a framework that provides a library and runtime environment for executing user code within the HBase region server and master processes. Google coprocessors in contrast run colocated with the tablet server but outside of its address space. (HBase is also considering an option for deployment of coprocessor code external to the server process. See https://issues.apache.org/jira/browse/HBASE-4047 .)

Coprocessors can be loaded globally on all tables and regions hosted by the region server, these are known as system coprocessors; or the administrator can specify which coprocessors should be loaded on all regions for a table on a per-table basis, these are known as table coprocessors.

In order to support sufficient flexibility for potential coprocessor behaviors, two different aspects of extension are provided by the framework. One is the observer, which are like triggers in conventional databases, and the other is the endpoint, dynamic RPC endpoints that resemble stored procedures.


The idea behind observers is that we can insert user code by overriding upcall methods provided by the coprocessor framework. The callback functions are executed from core HBase code when certain events occur. The coprocessor framework handles all of the details of invoking callbacks during various base HBase activities; the coprocessor need only insert the desired additional or alternate functionality.

Currently in HBase 0.92, we have three observer interfaces provided:

  • RegionObserver: Provides hooks for data manipulation events, Get, Put, Delete, Scan, and so on. There is an instance of a RegionObserver coprocessor for every table region and the scope of the observations they can make is constrained to that region.
  • WALObserver: Provides hooks for write-ahead log (WAL) related operations. This is a way to observe or intercept WAL writing and reconstruction events. A WALObserver runs in the context of WAL processing. There is one such context per region server.
  • MasterObserver: Provides hooks for DDL-type operation, i.e., create, delete, modify table, etc. The MasterObserver runs within the context of the HBase master.

More than one observer can be loaded at one place — region, master, or WAL — to extend functionality. They are chained to execute sequentially by order of assigned priorities. There is nothing preventing a coprocessor implementor from communicating internally between the contexts of his or her installed observers, providing comprehensive coverage of HBase functions.

A coprocessor at a higher priority may preempt action by those at lower priority by throwing an IOException (or a subclass of this). We will use this preemptive functionality below in an example of an Access Control coprocessor.

As we mentioned above, various events cause observer methods to be invoked on loaded observers. The set of events and method signatures are presented in the HBase API, beginning with HBase version 0.92. Please be aware that the API could be changed in future, due to the HBase Client API changes, or possibly other reasons. We’ve tried to stablize the API before 0.92 release but there is no guarantee.)

The RegionObserver interface provides callbacks for:

  • preOpen, postOpen: Called before and after the region is reported as online to the master.
  • preFlush, postFlush: Called before and after the memstore is flushed into a new store file.
  • preGet, postGet: Called before and after a client makes a Get request.
  • preExists, postExists: Called before and after the client tests for existence using a Get.
  • prePut and postPut: Called before and after the client stores a value.
  • preDelete and postDelete: Called before and after the client deletes a value.
  • etc.

Please refer to HBase 0.92 javadoc to get the whole list of declared methods.

We provide a convenient abstract class BaseRegionObserver, which implements all RegionObserver methods with default behaviors, so you can focus on what events you have interest in, without having to be concerned about process upcalls for all of them.

Here is a sequence diagram that shows how RegionObservers works with other HBase components:regionobserver.png

Below is an example of an extension that hooks into HBase function using the RegionObserver interface. It is a demonstration of the addition of simple access control. This coprocessor checks user information for a given client request by injecting code at certain RegionObserver preXXX hooks. If the user is not allowed to access the resource, an AccessDeniedException will be thrown. As we mentioned above, exceptions by high-priority coprocessors (such as an access control coprocessor) can be used to preempt further action. In this case, the AccessDeniedException means that the client’s request will not be processed and the client will receive the exception as indication of what happened.

package org.apache.hadoop.hbase.coprocessor; import java.util.List; import org.apache.hadoop.hbase.KeyValue; import org.apache.hadoop.hbase.client.Get; // Sample access-control coprocessor. It utilizes RegionObserver // and intercept preXXX() method to check user privilege for the given table // and column family. public class AccessControlCoprocessor extends BaseRegionObserver { @Override public void preGet(final ObserverContext<RegionCoprocessorEnvironment> c, final Get get, final List<KeyValue> result) throws IOException throws IOException { // check permissions.. if (!permissionGranted()) { throw new AccessDeniedException("User is not allowed to access."); } } // override prePut(), preDelete(), etc. } 

The MasterObserver interface provides upcalls for:

  • preCreateTable/postCreateTable: Called before and after the region is reported as online to the master.
  • preDeleteTable/postDeleteTable
  • etc.

WALObserver provides upcalls for:

  • preWALWrite/postWALWrite: called before and after a WALEdit written to WAL.

Please refer to the HBase javadoc for the whole list of observer interface declarations.


As mentioned previously, observers can be thought of like database triggers. Endpoints, on the other hand, are more powerful, resembling stored procedures. One can invoke an endpoint at any time from the client. The endpoint implementation will then be executed remotely at the target region or regions, and results from those executions will be returned to the client.

Endpoint is an interface for dynamic RPC extension. The endpoint implementation is installed on the server side and can then be invoked with HBase RPC. The client library provides convenience methods for invoking such dynamic interfaces.

Also as mentioned above, there is nothing stopping the implementation of an endpoint from communicating with any observer implementation. With these extension surfaces combined you can add whole new features to HBase without modifying or recompiling HBase itself. This can be very powerful.

In order to build and use your own endpoint, you need to:

  • Have a new protocol interface which extends CoprocessorProtocol.
  • Implement the Endpoint interface. The implementation will be loaded into and executed from the region context.
  • Extend the abstract class BaseEndpointCoprocessor. This convenience class hides some internal details that the implementer need not necessary be concerned about, such as coprocessor framework class loading.
  • On the client side, the Endpoint can be invoked by two new HBase client APIs:
    • Executing against a single region:
      HTableInterface.coprocessorProxy(Class<T> protocol, byte[] row) 
    • Executing over a range of regions
      HTableInterface.coprocessorExec(Class<T> protocol, byte[] startKey, byte[] endKey, Batch.Call<T,R> callable) 

Here is an example that shows how an endpoint works.

In this example, the endpoint will scan a given column in a region and aggregate values, expected to be serialized Long values, then return the result to client. The client collects the partial aggregates returned from remote endpoint invocation over individual regions, and sums the results to get the final answer over the full table. Note that the HBase client has the responsibility for dispatching parallel endpoint invocations to the target regions, and for collecting the returned results to present to the application code. This is like a lightweight MapReduce job: The “map” is the endpoint execution performed in the region server on every target region, and the “reduce” is the final aggregation at the client. Meanwhile, the coprocessor framework on the server side and in the client library is like the MapReduce framework, moving tedious distributed systems programming details behind a clean API, so the programmer can focus on the application.

Note also that HBase, and all of Hadoop, currently requires Java 6, which has a verbose syntax for anonymous classes. As HBase (and Hadoop) evolves with the introduction of Java 7 language features, we expect the verbosity of client endpoint code can be reduced substantially.

// A sample protocol for performing aggregation at regions. public static interface ColumnAggregationProtocol extends CoprocessorProtocol { // Perform aggregation for a given column at the region. The aggregation // will include all the rows inside the region. It can be extended to // allow passing start and end rows for a fine-grained aggregation. public long sum(byte[] family, byte[] qualifier) throwsIOException; } // Aggregation implementation at a region. public static class ColumnAggregationEndpoint extends BaseEndpointCoprocessor implements ColumnAggregationProtocol { @Override public long sum(byte[] family, byte[] qualifier) throws IOException { // aggregate at each region Scan scan = new Scan(); scan.addColumn(family, qualifier); long sumResult = 0; InternalScanner scanner = getEnvironment().getRegion().getScanner(scan); try { List<KeyValue> curVals = new ArrayList<KeyValue>(); boolean hasMore = false; do { curVals.clear(); hasMore = scanner.next(curVals); KeyValue kv = curVals.get(0); sumResult += Bytes.toLong(kv.getValue()); } while (hasMore); } finally { scanner.close(); } return sumResult; } } 

Client invocations are performed through new methods on HTableInterface:

public <T extends CoprocessorProtocol> T coprocessorProxy(Class<T> protocol, Row row); public <T extends CoprocessorProtocol, R> void coprocessorExec( Class<T> protocol, List<? extends Row> rows, BatchCall<T,R> callable, BatchCallback<R> callback); public <T extends CoprocessorProtocol, R> voidcoprocessorExec( Class<T> protocol, RowRange range, BatchCall<T,R> callable, BatchCallback<R> callback); 

Here is the client side example of calling ColumnAggregationEndpoint:

HTableInterface table = new HTable(util.getConfiguration(), TEST_TABLE); Scan scan; Map<byte[], Long> results; // scan: for all regions scan = new Scan(); results = table.coprocessorExec(ColumnAggregationProtocol.class, scan, new BatchCall<ColumnAggregationProtocol,Long>() { public Integer call(ColumnAggregationProtocol instance)throws IOException{ return instance.sum(TEST_FAMILY, TEST_QUALIFIER); } }); long sumResult = 0; long expectedResult = 0; for (Map.Entry<byte[], Long> e : results.entrySet()) { sumResult += e.getValue(); } 

The above example is actually a simplified version of HBASE-1512. You can refer to that JIRA or the HBase source code of org.apache.hadoop.hbase.coprocessor.AggregateImplementation for more detail.

Below is a visualization of dynamic RPC invocation for this example. The application code client side performs a batch call. This initiates parallel RPC invocations of the registered dynamic protocol on every target table region. The results of those invocations are returned as they become available. The client library manages this parallel communication on behalf of the application, messy details such as dealing with retries and errors, until all results are returned (or in the event of an unrecoverable error). Then the client library rolls up the responses into a Map and hands it over to the application. If an unrecoverable error occurs, then an exception will be thrown for the application code to catch and take action.rpc.png

Coprocessor Management

After you have a good understanding of how coprocessors work in HBase, you can start to build your own experimental coprocessors, deploy them to your HBase cluster, and observe the new behaviors.

Build Your Own Coprocessor

We now assume you have your coprocessor code ready, compiled and packaged as a jar file. You will see how coprocessor framework can be configured to load the coprocessor in the following sections.

(We should have a template coprocessor that helps users quickly start to develop. Currently there are some built-in coprocessors that can serve as examples and a starting point for implementation of a new coprocessor. However they are scattered over the code base. As discussed in HBASE-5273, there will be some coprocessors samples provided under src/example/coprocessor of the HBase source code. )

Coprocessor Deployment

Currently we provide two options for deploying coprocessor extensions: load from configuration, which happens when the master or region servers start up; or load from table attribute, dynamic loading when the table is (re)opened. Because most users will set table attributes by way of the ‘alter’ command of the HBase shell, let’s call this load from shell.

Load from Configuration

When a region is opened, the framework tries to read coprocessor class names supplied as the configuration entries:

  • hbase.coprocessor.region.classes: for RegionObservers and Endpoints
  • hbase.coprocessor.master.classes: for MasterObservers
  • hbase.coprocessor.wal.classes: for WALObservers

Hers is an example of the hbase-site.xml where one RegionObserver is configured for all the HBase tables:

<property> <name>hbase.coprocessor.region.classes</name> <value>org.apache.hadoop.hbase.coprocessor.AggregateImplementation</value> </property> 

If there are multiple classes specified for loading, the class names must be comma separated. Then, the framework will try to load all the configured classes using the default class loader. This means the jar file must reside on the server side HBase classpath.

If loaded in this manner, the coprocessors will be active on all regions of all tables. This is the system coprocessor as earlier introduced. The first listed coprocessors will be assigned the priority Coprocessor.Priority.SYSTEM. Each subsequent coprocessor in the list will have its priority value incremented by one (which reduces its priority, priorities have the natural sort order of Integers).

We have not really discussed priority, but it should be reasonably clear how the priority given to a coprocessor affects how it integrates with other coprocessors. When calling out to registered observers, the framework executes their callbacks methods in the sorted order of their priority. Ties are broken arbitrarily.

Load from Shell

Coprocessors can also be configured to load on a per table basis, via a shell command “alter’’ + “table_att”.

hbase(main):005:0> alter 't1', METHOD => 'table_att', 'coprocessor'=>'hdfs:///foo.jar|com.foo.FooRegionObserver|1001|arg1=1,arg2=2' Updating all regions with the new schema... 1/1 regions updated. Done. 0 row(s) in 1.0730 seconds hbase(main):006:0> describe 't1' DESCRIPTION ENABLED {NAME => 't1', coprocessor$1 => 'hdfs:///foo.jar|com.foo.FooRegio false nObserver|1001|arg1=1,arg2=2', FAMILIES => [{NAME => 'c1', DATA_B LOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZ E => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLO CKCACHE => 'true'}, {NAME => 'f1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3' , COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647' , KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0190 seconds 

The coprocessor framework will try to read the class information from the coprocessor table attribute value. The value contains four pieces of information which are separated by “|’’:

  • File path: The jar file containing the coprocessor implementation must located somewhere where all region servers can read it. The file could be copied somewhere onto the local disk of all region servers, but we recommended storing the file into HDFS instead. If no file path is given, the framework will attempt to load the class from the server classpath using the default class loader.
  • Class name: The full class name of the coprocessor.
  • Priority: An integer. The framework will determine the execution sequence of all configured observers registered at the same hook using priorities. This field can be left blank. In that case the framework will assign a default priority value.
  • Arguments: This field is passed to the coprocessor implementation.

You can also remove a loaded coprocessor at shell, by “alter” + “table_att_unset” command:

hbase(main):007:0> alter 't1', METHOD => 'table_att_unset', hbase(main):008:0* NAME => 'coprocessor$1' Updating all regions with the new schema... 1/1 regions updated. Done. 0 row(s) in 1.1130 seconds hbase(main):009:0> describe 't1' DESCRIPTION ENABLED {NAME => 't1', FAMILIES => [{NAME => 'c1', DATA_BLOCK_ENCODING => false 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSION S => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '214 7483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN _MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true '}, {NAME => 'f1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_C ELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCO DE_ON_DISK => 'true', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0180 seconds 

(Note: this is the current behavior in 0.92. There has been some discussion of refactoring how table attributes are handled for coprocessors and other features using the shell’s ‘alter’ command. See HBASE-4879 for more detail.)

In this section, we discussed how to tell HBase which coprocessors are expected to be loaded, but there is no guarantee that the framework will load them successfully. For example, the shell command neither guarantees a jar file exists at a particular location nor verifies if the given class is actually contained in the jar file.

HBase Shell Coprocessor Status

After a coprocessor has been configured, you also need to check the coprocessor status using the shell or master and region server web UIs to determine if the coprocessor has been loaded successfully.

Shell command:

hbase(main):018:0> alter 't1', METHOD => 'table_att', 'coprocessor'=>'|org.apache.hadoop.hbase.coprocessor.AggregateImplementation|1001|arg1=1,arg2=2' Updating all regions with the new schema... 1/1 regions updated. Done. 0 row(s) in 1.1060 seconds hbase(main):019:0> enable 't1' 0 row(s) in 2.0620 seconds hbase(main):020:0> status 'detailed' version 0.92-tm-6 0 regionsInTransition master coprocessors: [] 1 live servers localhost:52761 1328082515520 requestsPerSecond=3, numberOfOnlineRegions=3, usedHeapMB=32, maxHeapMB=995 -ROOT-,,0 numberOfStores=1, numberOfStorefiles=1, storefileUncompressedSizeMB=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0, readRequestsCount=54, writeRequestsCount=1, rootIndexSizeKB=0, totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0, currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[] .META.,,1 numberOfStores=1, numberOfStorefiles=0, storefileUncompressedSizeMB=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0, readRequestsCount=97, writeRequestsCount=4, rootIndexSizeKB=0, totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0, currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[] t1,,1328082575190.c0491168a27620ffe653ec6c04c9b4d1. numberOfStores=2, numberOfStorefiles=1, storefileUncompressedSizeMB=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0, readRequestsCount=0, writeRequestsCount=0, rootIndexSizeKB=0, totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0, currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[AggregateImplementation] 0 dead servers 

If you cannot find the coprocessor loaded, you need to check the server log files to discover the reason for its failure to load.

Current Status

There are several JIRAs opened for coprocessor development. HBASE-2000 functioned as the umbrella for coprocessor development. HBASE-2001 covered coprocessor framework development. HBASE-2002 covered RPC extensions for endpoints. Code resulting from work on these issues was committed to HBase trunk in 2010, and are available beginning with the 0.92.0 release.

There are some new features developed on top of coprocessors:

  • HBase access control (HBASE-3025, HBASE-3045): This is experimental basic access control for HBase, available in 0.92.0, but unlikely to be fully functional prior to 0.92.1.
  • Column aggregates (HBASE-1512): Providing experimental support for SQL-like sum(), avg(), max(), min(), etc. functions over columns. Available in 0.92.0.
  • Constraints (HBASE-4605): A mechanism for simple restrictions on the domain of a data attribute. For more information read the Constraints package summary: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/constraint/package-summary.html . It’s in trunk, will be available in the 0.94 releases.

Future Work

Coprocessors are a brand new feature available starting with the 0.92.0 release. While they open many doors, there is much room for improvement. More hooks and interfaces may be added over time to support more use cases, such as:

Parallel Computation Framework

By way of endpoints, we have a new dynamic way to inject user code into the processing of actions on individual table regions, and with the corresponding client side support we can interrogate them all in parallel and return results to the client in a flexible manner. This is immediately useful for building batch data processing and aggregation on top of HBase. However you need to understand some internal HBase detail to develop such applications.

Various JIRAs have been opened that consider exposing additional framework for parallel computation that can provide a convenient but powerful higher level of abstraction. Options under consideration include MapReduce APIs similar to those provided by Hadoop; support for scriptlets, i.e. Ruby script fragments sent server side to perform work; or something akin or directly supporting the Cascading framework (http://cascading.org) on the server for working with data flows more abstractly.

However, as far as I know, none of them is under construction right now.

External Coprocessor Host (HBASE-4047)

Where HBase coprocessors deviate substantially from the design of Google’s BigTable coprocessors is we have reimagined them as a framework for internal extension. In contrast, BigTable coprocessors run as separate processes colocated with tablet servers. The essential trade off is between performance, flexibility and possibility; and the ability to control and enforce resource usage.

We are considering developing a coprocessor that is a generic host for another coprocessor. The host installs in-process into the master or region servers, but the user code will be loaded into a forked child process. An eventing model and umbilical protocol over a bidirectional pipe between the parent and child will provide the user code loaded into the child the same semantics as if it were loaded internally to the parent. However, we immediately isolate the user code from HBase internals, and eventually expect to use available resource management capabilities on the platform to further limit child resource consumption as desired by system administrators or the application designer.

Code Weaving (HBASE-2058)

Right now there are no constraints on what actions a coprocessor can take. We do not protect against malicious actions or faults accidentally introduced by a coprocessor. As an alternative to external coprocessor hosts we could introduce code weaving and code transformation at coprocessor load time. We would weave in a configurable set of policies for constraining the actions a coprocessor can take. For example:

  • Improve fault isolation and system integrity protections via various static checks
  • Wrap heap allocations to enforce limits
  • Monitor CPU time via instrumentation injected into method and loop headers
  • Reject static or dynamic method invocations to APIs considered unsafe

And More…

Coprocessors framework provides possibilities to extend HBase. There are some more identified applications which can be built on top of coprocessors:

  • HBase isolation and allocation (HBase-4120)
  • Secondary indexing: http://wiki.apache.org/hadoop/Hbase/SecondaryIndexing
  • Search in HBase (HBASE-3529)
  • HBase table, region access statistic.
  • and more …