Tag Archives: ACID

[repost ]Paper: Warp: Multi-Key Transactions For Key-Value Stores


Looks like an interesting take on “a completely asynchronous, low-latency transaction management protocol, in line with the fully distributed NoSQL architecture.”

Warp: Multi-Key Transactions for Key-Value Stores overview:

Implementing ACID transactions has been a longstanding challenge for NoSQL systems. Because these systems are based on a sharded architecture, transactions necessarily require coordination across multiple servers. Past work in this space has relied either on heavyweight protocols such as Paxos or clock synchronization for this coordination.

This paper presents a novel protocol for coordinating distributed transactions with ACID semantics on top of a sharded data store. Called linear transactions, this protocol achieves scalability by distributing the coordination task to only those servers that hold relevant data for each transaction. It achieves high performance by serializing only those transactions whose concurrent execution could potentially yield a violation of ACID semantics. Finally, it naturally integrates chain-replication and can thus tolerate faults of both clients and servers. We have fully implemented linear transactions in a commercially available data store. Experiments show that the throughput of this system achieves 1-9× more throughput than MongoDB, Cassandra and HyperDex on the Yahoo! Cloud Serving Benchmark, even though none of the latter systems provide transactional guarantees.

[repost ]Myth: Eric Brewer On Why Banks Are BASE Not ACID – Availability Is Revenue


In NoSQL: Past, Present, Future Eric Brewer has a particularly fine section on explaining the often hard to understand ideas of BASE (Basically Available, Soft State, Eventually Consistent), ACID (Atomicity, Consistency, Isolation, Durability), CAP (Consistency Availability, Partition Tolerance), in terms of a pernicious long standing myth about the sanctity of consistency in banking.

Myth: Money is important, so banks must use transactions to keep money safe and consistent, right?

Reality: Banking transactions are inconsistent, particularly for ATMs. ATMs are designed to have a normal case behaviour and a partition mode behaviour. In partition mode Availability is chosen over Consistency.

Why? 1) Availability correlates with revenue and consistency generally does not. 2) Historically there was never an idea of perfect communication so everything was partitioned.

Your ATM transaction must go through so Availability is more important than consistency. If the ATM is down then you aren’t making money. If you can fudge the consistency and stay up and compensate for other mistakes (which are rare), you’ll make more money. That’s the space most enterprises find themselves so BASE is more popular than it used to be.

This is not a new problem for the financial industry. They’ve never had consistency because historically they’ve never had perfect communication. Instead, the financial industry depends on auditing. What accounts for the consistency of bank data is not the consistency of its databases but the fact that everything is written down twice and sorted out later using a permanent and unalterable record that is reconciled later. The idea of financial compensation for errors is an idea built deeply into the financial system.

During the Renaissance, when the modern banking system started to take shape, everything was partitioned. If letters, your data, are transported by horse or over ships, then it’s likely you data will have a very low consistency, yet they still had an amazingly rich and successful banking system. Transactions were unnecessary.

ATMs, for example, chose commutative operations like increment and decrement, so the order in which the operations are applied doesn’t matter. They are reorderable and can be made consistent later. If an ATM is disconnected from the network and when the partition eventually heals, the ATM sends sends a list of operations to the bank and the end balance will still be correct. The issue is obviously you might withdraw more money than you have so the end result might be consistent, but negative, which can’t be compensated for by asking for the money back, so instead, the bank will reward you with an overdraft penalty.

The hidden philosophy is that you are trying to bound and manage your risk, yet still have all operations available. In the ATM case this would be a limit on the maximum amount of money you can take out at any one time. It’s not that big of a risk. ATMs are profitable so the occasional loss is just the risk of doing business.

Consistency it turns out is not the Holy Grail. What trumps consistency is:


  • Auditing
  • Risk Management
  • Availability


In a post-internet world where write availability is key the real world looks more like weak consistency + delayed exceptions + compensation rather than a mistake free world of perfect communication and transactions. Just like the old days, but now you have far more options on the ACID <—> CAP spectrum.

[repost ]Apache HBase (TM) and ACID


About this Document

Apache HBase (TM) is not an ACID compliant database. However, it does guarantee certain specific properties.

This specification enumerates the ACID properties of HBase.


For the sake of common vocabulary, we define the following terms:

an operation is atomic if it either completes entirely or not at all
all actions cause the table to transition from one valid state directly to another (eg a row will not disappear during an update, etc)
an operation is isolated if it appears to complete independently of any other concurrent transaction
any update that reports “successful” to the client will not be lost
an update is considered visible if any subsequent read will see the update as having been committed

The terms must and may are used as specified by RFC 2119. In short, the word “must” implies that, if some case exists where the statement is not true, it is a bug. The word “may” implies that, even if the guarantee is provided in a current release, users should not rely on it.

APIs to consider

  • Read APIs
    • get
    • scan
  • Write APIs
    • put
    • batch put
    • delete
  • Combination (read-modify-write) APIs
    • incrementColumnValue
    • checkAndPut

Guarantees Provided


  1. All mutations are atomic within a row. Any put will either wholely succeed or wholely fail.[3]
    1. An operation that returns a “success” code has completely succeeded.
    2. An operation that returns a “failure” code has completely failed.
    3. An operation that times out may have succeeded and may have failed. However, it will not have partially succeeded or failed.
  2. This is true even if the mutation crosses multiple column families within a row.
  3. APIs that mutate several rows will _not_ be atomic across the multiple rows. For example, a multiput that operates on rows ‘a’,’b’, and ‘c’ may return having mutated some but not all of the rows. In such cases, these APIs will return a list of success codes, each of which may be succeeded, failed, or timed out as described above.
  4. The checkAndPut API happens atomically like the typical compareAndSet (CAS) operation found in many hardware architectures.
  5. The order of mutations is seen to happen in a well-defined order for each row, with no interleaving. For example, if one writer issues the mutation “a=1,b=1,c=1” and another writer issues the mutation “a=2,b=2,c=2”, the row must either be “a=1,b=1,c=1” or “a=2,b=2,c=2” and must notbe something like “a=1,b=2,c=1”.
    1. Please note that this is not true _across rows_ for multirow batch mutations.

Consistency and Isolation

  1. All rows returned via any access API will consist of a complete row that existed at some point in the table’s history.
  2. This is true across column families – i.e a get of a full row that occurs concurrent with some mutations 1,2,3,4,5 will return a complete row that existed at some point in time between mutation i and i+1 for some i between 1 and 5.
  3. The state of a row will only move forward through the history of edits to it.

Consistency of Scans

A scan is not a consistent view of a table. Scans do not exhibit snapshot isolation.

Rather, scans have the following properties:

  1. Any row returned by the scan will be a consistent view (i.e. that version of the complete row existed at some point in time) [1]
  2. A scan will always reflect a view of the data at least as new asthe beginning of the scan. This satisfies the visibility guarantees enumerated below.
    1. For example, if client A writes data X and then communicates via a side channel to client B, any scans started by client B will contain data at least as new as X.
    2. A scan _must_ reflect all mutations committed prior to the construction of the scanner, and _may_ reflect some mutations committed subsequent to the construction of the scanner.
    3. Scans must include all data written prior to the scan (except in the case where data is subsequently mutated, in which case it _may_ reflect the mutation)

Those familiar with relational databases will recognize this isolation level as “read committed”.

Please note that the guarantees listed above regarding scanner consistency are referring to “transaction commit time”, not the “timestamp” field of each cell. That is to say, a scanner started at time t may see edits with a timestamp value greater than t, if those edits were committed with a “forward dated” timestamp before the scanner was constructed.


  1. When a client receives a “success” response for any mutation, that mutation is immediately visible to both that client and any client with whom it later communicates through side channels. [3]
  2. A row must never exhibit so-called “time-travel” properties. That is to say, if a series of mutations moves a row sequentially through a series of states, any sequence of concurrent reads will return a subsequence of those states.
    1. For example, if a row’s cells are mutated using the “incrementColumnValue” API, a client must never see the value of any cell decrease.
    2. This is true regardless of which read API is used to read back the mutation.
  3. Any version of a cell that has been returned to a read operation is guaranteed to be durably stored.


  1. All visible data is also durable data. That is to say, a read will never return data that has not been made durable on disk[2]
  2. Any operation that returns a “success” code (eg does not throw an exception) will be made durable.[3]
  3. Any operation that returns a “failure” code will not be made durable (subject to the Atomicity guarantees above)
  4. All reasonable failure scenarios will not affect any of the guarantees of this document.


All of the above guarantees must be possible within Apache HBase. For users who would like to trade off some guarantees for performance, HBase may offer several tuning options. For example:

  • Visibility may be tuned on a per-read basis to allow stale reads or time travel.
  • Durability may be tuned to only flush data to disk on a periodic basis

More Information

For more information, see the client architecture or data model sections in the Apache HBase Reference Guide.


[1] A consistent view is not guaranteed intra-row scanning — i.e. fetching a portion of a row in one RPC then going back to fetch another portion of the row in a subsequent RPC. Intra-row scanning happens when you set a limit on how many values to return per Scan#next (See Scan#setBatch(int)).

[2] In the context of Apache HBase, “durably on disk” implies an hflush() call on the transaction log. This does not actually imply an fsync() to magnetic media, but rather just that the data has been written to the OS cache on all replicas of the log. In the case of a full datacenter power loss, it is possible that the edits are not truly durable.

[3] Puts will either wholely succeed or wholely fail, provided that they are actually sent to the RegionServer. If the writebuffer is used, Puts will not be sent until the writebuffer is filled or it is explicitly flushed.

[repost ]architecture:The NoSQL Ecosystem


Chapter 13. The NoSQL Ecosystem

Adam Marcus

Unlike most of the other projects in this book, NoSQL is not a tool, but an ecosystem composed of several complimentary and competing tools. The tools branded with the NoSQL monicker provide an alternative to SQL-based relational database systems for storing data. To understand NoSQL, we have to understand the space of available tools, and see how the design of each one explores the space of data storage possibilities.

If you are considering using a NoSQL storage system, you should first understand the wide space of options that NoSQL systems span. NoSQL systems do away with many of the traditional comforts of relational database systems, and operations which were typically encapsulated behind the system boundary of a database are now left to application designers. This requires you to take on the hat of a systems architect, which requires a more in-depth understanding of how such systems are built.

13.1. What’s in a Name?

In defining the space of NoSQL, let’s first take a stab at defining the name. Taken literally, a NoSQL system presents a query interface to the user that is not SQL. The NoSQL community generally takes a more inclusive view, suggesting that NoSQL systems provide alternatives to traditional relational databases, and allow developers to design projects which use Not Only a SQL interface. In some cases, you might replace a relational database with a NoSQL alternative, and in others you will employ a mix-and-match approach to different problems you encounter in application development.

Before diving into the world of NoSQL, let’s explore the cases where SQL and the relational model suit your needs, and others where a NoSQL system might be a better fit.

13.1.1. SQL and the Relational Model

SQL is a declarative language for querying data. A declarative language is one in which a programmer specifies what they want the system to do, rather than procedurally defining how the system should do it. A few examples include: find the record for employee 39, project out only the employee name and phone number from their entire record, filter employee records to those that work in accounting, count the employees in each department, or join the data from the employees table with the managers table.

To a first approximation, SQL allows you to ask these questions without thinking about how the data is laid out on disk, which indices to use to access the data, or what algorithms to use to process the data. A significant architectural component of most relational databases is a query optimizer, which decides which of the many logically equivalent query plans to execute to most quickly answer a query. These optimizers are often better than the average database user, but sometimes they do not have enough information or have too simple a model of the system in order to generate the most efficient execution.

Relational databases, which are the most common databases used in practice, follow the relational data model. In this model, different real-world entities are stored in different tables. For example, all employees might be stored in an Employees table, and all departments might be stored in a Departments table. Each row of a table has various properties stored in columns. For example, employees might have an employee id, salary, birth date, and first/last names. Each of these properties will be stored in a column of the Employees table.

The relational model goes hand-in-hand with SQL. Simple SQL queries, such as filters, retrieve all records whose field matches some test (e.g., employeeid = 3, or salary > $20000). More complex constructs cause the database to do some extra work, such as joining data from multiple tables (e.g., what is the name of the department in which employee 3 works?). Other complex constructs such as aggregates (e.g., what is the average salary of my employees?) can lead to full-table scans.

The relational data model defines highly structured entities with strict relationships between them. Querying this model with SQL allows complex data traversals without too much custom development. The complexity of such modeling and querying has its limits, though:

  • Complexity leads to unpredictability. SQL’s expressiveness makes it challenging to reason about the cost of each query, and thus the cost of a workload. While simpler query languages might complicate application logic, they make it easier to provision data storage systems, which only respond to simple requests.
  • There are many ways to model a problem. The relational data model is strict: the schema assigned to each table specifies the data in each row. If we are storing less structured data, or rows with more variance in the columns they store, the relational model may be needlessly restrictive. Similarly, application developers might not find the relational model perfect for modeling every kind of data. For example, a lot of application logic is written in object-oriented languages and includes high-level concepts such as lists, queues, and sets, and some programmers would like their persistence layer to model this.
  • If the data grows past the capacity of one server, then the tables in the database will have to be partitioned across computers. To avoid JOINs having to cross the network in order to get data in different tables, we will have to denormalize it. Denormalization stores all of the data from different tables that one might want to look up at once in a single place. This makes our database look like a key-lookup storage system, leaving us wondering what other data models might better suit the data.

It’s generally not wise to discard many years of design considerations arbitrarily. When you consider storing your data in a database, consider SQL and the relational model, which are backed by decades of research and development, offer rich modeling capabilities, and provide easy-to-understand guarantees about complex operations. NoSQL is a good option when you have a specific problem, such as large amounts of data, a massive workload, or a difficult data modeling decision for which SQL and relational databases might not have been optimized.

13.1.2. NoSQL Inspirations

The NoSQL movement finds much of its inspiration in papers from the research community. While many papers are at the core of design decisions in NoSQL systems, two stand out in particular.

Google’s BigTable [CDG+06] presents an interesting data model, which facilitates sorted storage of multi-column historical data. Data is distributed to multiple servers using a hierarchical range-based partitioning scheme, and data is updated with strict consistency (a concept that we will eventually define in Section 13.5).

Amazon’s Dynamo [DHJ+07] uses a different key-oriented distributed datastore. Dynamo’s data model is simpler, mapping keys to application-specific blobs of data. The partitioning model is more resilient to failure, but accomplishes that goal through a looser data consistency approach called eventual consistency.

We will dig into each of these concepts in more detail, but it is important to understand that many of them can be mixed and matched. Some NoSQL systems such as HBase1 sticks closely to the BigTable design. Another NoSQL system named Voldemort2 replicates many of Dynamo’s features. Still other NoSQL projects such as Cassandra3 have taken some features from BigTable (its data model) and others from Dynamo (its partitioning and consistency schemes).

13.1.3. Characteristics and Considerations

NoSQL systems part ways with the hefty SQL standard and offer simpler but piecemeal solutions for architecting storage solutions. These systems were built with the belief that in simplifying how a database operates over data, an architect can better predict the performance of a query. In many NoSQL systems, complex query logic is left to the application, resulting in a data store with more predictable query performance because of the lack of variability in queries

NoSQL systems part with more than just declarative queries over the relational data. Transactional semantics, consistency, and durability are guarantees that organizations such as banks demand of databases. Transactions provide an all-or-nothing guarantee when combining several potentially complex operations into one, such as deducting money from one account and adding the money to another. Consistency ensures that when a value is updated, subsequent queries will see the updated value. Durability guarantees that once a value is updated, it will be written to stable storage (such as a hard drive) and recoverable if the database crashes.

NoSQL systems relax some of these guarantees, a decision which, for many non-banking applications, can provide acceptable and predictable behavior in exchange for improved performance. These relaxations, combined with data model and query language changes, often make it easier to safely partition a database across multiple machines when the data grows beyond a single machine’s capability.

NoSQL systems are still very much in their infancy. The architectural decisions that go into the systems described in this chapter are a testament to the requirements of various users. The biggest challenge in summarizing the architectural features of several open source projects is that each one is a moving target. Keep in mind that the details of individual systems will change. When you pick between NoSQL systems, you can use this chapter to guide your thought process, but not your feature-by-feature product selection.

As you think about NoSQL systems, here is a roadmap of considerations:

  • Data and query model: Is your data represented as rows, objects, data structures, or documents? Can you ask the database to calculate aggregates over multiple records?
  • Durability: When you change a value, does it immediately go to stable storage? Does it get stored on multiple machines in case one crashes?
  • Scalability: Does your data fit on a single server? Do the amount of reads and writes require multiple disks to handle the workload?
  • Partitioning: For scalability, availability, or durability reasons, does the data need to live on multiple servers? How do you know which record is on which server?
  • Consistency: If you’ve partitioned and replicated your records across multiple servers, how do the servers coordinate when a record changes?
  • Transactional semantics: When you run a series of operations, some databases allow you to wrap them in a transaction, which provides some subset of ACID (Atomicity, Consistency, Isolation, and Durability) guarantees on the transaction and all others currently running. Does your business logic require these guarantees, which often come with performance tradeoffs?
  • Single-server performance: If you want to safely store data on disk, what on-disk data structures are best-geared toward read-heavy or write-heavy workloads? Is writing to disk your bottleneck?
  • Analytical workloads: We’re going to pay a lot of attention to lookup-heavy workloads of the kind you need to run a responsive user-focused web application. In many cases, you will want to build dataset-sized reports, aggregating statistics across multiple users for example. Does your use-case and toolchain require such functionality?

While we will touch on all of these consideration, the last three, while equally important, see the least attention in this chapter.

13.2. NoSQL Data and Query Models

The data model of a database specifies how data is logically organized. Its query model dictates how the data can be retrieved and updated. Common data models are the relational model, key-oriented storage model, or various graph models. Query languages you might have heard of include SQL, key lookups, and MapReduce. NoSQL systems combine different data and query models, resulting in different architectural considerations.

13.2.1. Key-based NoSQL Data Models

NoSQL systems often part with the relational model and the full expressivity of SQL by restricting lookups on a dataset to a single field. For example, even if an employee has many properties, you might only be able to retrieve an employee by her ID. As a result, most queries in NoSQL systems are key lookup-based. The programmer selects a key to identify each data item, and can, for the most part, only retrieve items by performing a lookup for their key in the database.

In key lookup-based systems, complex join operations or multiple-key retrieval of the same data might require creative uses of key names. A programmer wishing to look up an employee by his employee ID and to look up all employees in a department might create two key types. For example, the key employee:30 would point to an employee record for employee ID 30, and employee_departments:20 might contain a list of all employees in department 20. A join operation gets pushed into application logic: to retrieve employees in department 20, an application first retrieves a list of employee IDs from key employee_departments:20, and then loops over key lookups for each employee:ID in the employee list.

The key lookup model is beneficial because it means that the database has a consistent query pattern—the entire workload consists of key lookups whose performance is relatively uniform and predictable. Profiling to find the slow parts of an application is simpler, since all complex operations reside in the application code. On the flip side, the data model logic and business logic are now more closely intertwined, which muddles abstraction.

Let’s quickly touch on the data associated with each key. Various NoSQL systems offer different solutions in this space.

Key-Value Stores

The simplest form of NoSQL store is a key-value store. Each key is mapped to a value containing arbitrary data. The NoSQL store has no knowledge of the contents of its payload, and simply delivers the data to the application. In our Employee database example, one might map the key employee:30 to a blob containing JSON or a binary format such as Protocol Buffers4, Thrift5, or Avro6 in order to encapsulate the information about employee 30.

If a developer uses structured formats to store complex data for a key, she must operate against the data in application space: a key-value data store generally offers no mechanisms for querying for keys based on some property of their values. Key-value stores shine in the simplicity of their query model, usually consisting of set, get, and delete primitives, but discard the ability to add simple in-database filtering capabilities due to the opacity of their values. Voldemort, which is based on Amazon’s Dynamo, provides a distributed key-value store. BDB7 offers a persistence library that has a key-value interface.

Key-Data Structure Stores

Key-data structure stores, made popular by Redis8, assign each value a type. In Redis, the available types a value can take on are integer, string, list, set, and sorted set. In addition to set/get/delete, type-specific commands, such as increment/decrement for integers, or push/pop for lists, add functionality to the query model without drastically affecting performance characteristics of requests. By providing simple type-specific functionality while avoiding multi-key operations such as aggregation or joins, Redis balances functionality and performance.

Key-Document Stores

Key-document stores, such as CouchDB9, MongoDB10, and Riak11, map a key to some document that contains structured information. These systems store documents in a JSON or JSON-like format. They store lists and dictionaries, which can be embedded recursively inside one-another.

MongoDB separates the keyspace into collections, so that keys for Employees and Department, for example, do not collide. CouchDB and Riak leave type-tracking to the developer. The freedom and complexity of document stores is a double-edged sword: application developers have a lot of freedom in modeling their documents, but application-based query logic can become exceedingly complex.

BigTable Column Family Stores

HBase and Cassandra base their data model on the one used by Google’s BigTable. In this model, a key identifies a row, which contains data stored in one or more Column Families (CFs). Within a CF, each row can contain multiple columns. The values within each column are timestamped, so that several versions of a row-column mapping can live within a CF.

Conceptually, one can think of Column Families as storing complex keys of the form (row ID, CF, column, timestamp), mapping to values which are sorted by their keys. This design results in data modeling decisions which push a lot of functionality into the keyspace. It is particularly good at modeling historical data with timestamps. The model naturally supports sparse column placement since row IDs that do not have certain columns do not need an explicit NULL value for those columns. On the flip side, columns which have few or no NULL values must still store the column identifier with each row, which leads to greater space consumption.

Each project data model differs from the original BigTable model in various ways, but Cassandra’s changes are most notable. Cassandra introduces the notion of a supercolumn within each CF to allow for another level of mapping, modeling, and indexing. It also does away with a notion of locality groups, which can physically store multiple column families together for performance reasons.

13.2.2. Graph Storage

One class of NoSQL stores are graph stores. Not all data is created equal, and the relational and key-oriented data models of storing and querying data are not the best for all data. Graphs are a fundamental data structure in computer science, and systems such as HyperGraphDB12 and Neo4J13 are two popular NoSQL storage systems for storing graph-structured data. Graph stores differ from the other stores we have discussed thus far in almost every way: data models, data traversal and querying patterns, physical layout of data on disk, distribution to multiple machines, and the transactional semantics of queries. We can not do these stark differences justice given space limitations, but you should be aware that certain classes of data may be better stored and queried as a graph.

13.2.3. Complex Queries

There are notable exceptions to key-only lookups in NoSQL systems. MongoDB allows you to index your data based on any number of properties and has a relatively high-level language for specifying which data you want to retrieve. BigTable-based systems support scanners to iterate over a column family and select particular items by a filter on a column. CouchDB allows you to create different views of the data, and to run MapReduce tasks across your table to facilitate more complex lookups and updates. Most of the systems have bindings to Hadoop or another MapReduce framework to perform dataset-scale analytical queries.

13.2.4. Transactions

NoSQL systems generally prioritize performance over transactional semantics. Other SQL-based systems allow any set of statements—from a simple primary key row retrieval, to a complicated join between several tables which is then subsequently averaged across several fields—to be placed in a transaction.

These SQL databases will offer ACID guarantees between transactions. Running multiple operations in a transaction is Atomic (the A in ACID), meaning all or none of the operations happen. Consistency (the C) ensures that the transaction leaves the database in a consistent, uncorrupted state. Isolation (the I) makes sure that if two transactions touch the same record, they will do without stepping on each other’s feet. Durability (the D, covered extensively in the next section), ensures that once a transaction is committed, it’s stored in a safe place.

ACID-compliant transactions keep developers sane by making it easy to reason about the state of their data. Imagine multiple transactions, each of which has multiple steps (e.g., first check the value of a bank account, then subtract $60, then update the value). ACID-compliant databases often are limited in how they can interleave these steps while still providing a correct result across all transactions. This push for correctness results in often-unexpected performance characteristics, where a slow transaction might cause an otherwise quick one to wait in line.

Most NoSQL systems pick performance over full ACID guarantees, but do provide guarantees at the key level: two operations on the same key will be serialized, avoiding serious corruption to key-value pairs. For many applications, this decision will not pose noticeable correctness issues, and will allow quick operations to execute with more regularity. It does, however, leave more considerations for application design and correctness in the hands of the developer.

Redis is the notable exception to the no-transaction trend. On a single server, it provides a MULTI command to combine multiple operations atomically and consistently, and a WATCH command to allow isolation. Other systems provide lower-level test-and-set functionality which provides some isolation guarantees.

13.2.5. Schema-free Storage

A cross-cutting property of many NoSQL systems is the lack of schema enforcement in the database. Even in document stores and column family-oriented stores, properties across similar entities are not required to be the same. This has the benefit of supporting less structured data requirements and requiring less performance expense when modifying schemas on-the-fly. The decision leaves more responsibility to the application developer, who now has to program more defensively. For example, is the lack of a lastname property on an employee record an error to be rectified, or a schema update which is currently propagating through the system? Data and schema versioning is common in application-level code after a few iterations of a project which relies on sloppy-schema NoSQL systems.

13.3. Data Durability

Ideally, all data modifications on a storage system would immediately be safely persisted and replicated to multiple locations to avoid data loss. However, ensuring data safety is in tension with performance, and different NoSQL systems make different data durability guarantees in order to improve performance. Failure scenarios are varied and numerous, and not all NoSQL systems protect you against these issues.

A simple and common failure scenario is a server restart or power loss. Data durability in this case involves having moved the data from memory to a hard disk, which does not require power to store data. Hard disk failure is handled by copying the data to secondary devices, be they other hard drives in the same machine (RAID mirroring) or other machines on the network. However, a data center might not survive an event which causes correlated failure (a tornado, for example), and some organizations go so far as to copy data to backups in data centers several hurricane widths apart. Writing to hard drives and copying data to multiple servers or data centers is expensive, so different NoSQL systems trade off durability guarantees for performance.

13.3.1. Single-server Durability

The simplest form of durability is a single-server durability, which ensures that any data modification will survive a server restart or power loss. This usually means writing the changed data to disk, which often bottlenecks your workload. Even if you order your operating system to write data to an on-disk file, the operating system may buffer the write, avoiding an immediate modification on disk so that it can group several writes together into a single operation. Only when the fsync system call is issued does the operating system make a best-effort attempt to ensure that buffered updates are persisted to disk.

Typical hard drives can perform 100-200 random accesses (seeks) per second, and are limited to 30-100 MB/sec of sequential writes. Memory can be orders of magnitudes faster in both scenarios. Ensuring efficient single-server durability means limiting the number of random writes your system incurs, and increasing the number of sequential writes per hard drive. Ideally, you want a system to minimize the number of writes between fsync calls, maximizing the number of those writes that are sequential, all the while never telling the user their data has been successfully written to disk until that write has been fsynced. Let’s cover a few techniques for improving performance of single-server durability guarantees.

Control fsync Frequency

Memcached14 is an example of a system which offers no on-disk durability in exchange for extremely fast in-memory operations. When a server restarts, the data on that server is gone: this makes for a good cache and a poor durable data store.

Redis offers developers several options for when to call fsync. Developers can force an fsync call after every update, which is the slow and safe choice. For better performance, Redis can fsync its writes every N seconds. In a worst-case scenario, the you will lose last N seconds worth of operations, which may be acceptable for certain uses. Finally, for use cases where durability is not important (maintaining coarse-grained statistics, or using Redis as a cache), the developer can turn off fsync calls entirely: the operating system will eventually flush the data to disk, but without guarantees of when this will happen.

Increase Sequential Writes by Logging

Several data structures, such as B+Trees, help NoSQL systems quickly retrieve data from disk. Updates to those structures result in updates in random locations in the data structures’ files, resulting in several random writes per update if you fsync after each update. To reduce random writes, systems such as Cassandra, HBase, Redis, and Riak append update operations to a sequentially-written file called a log. While other data structures used by the system are only periodically fsynced, the log is frequently fsynced. By treating the log as the ground-truth state of the database after a crash, these storage engines are able to turn random updates into sequential ones.

While NoSQL systems such as MongoDB perform writes in-place in their data structures, others take logging even further. Cassandra and HBase use a technique borrowed from BigTable of combining their logs and lookup data structures into one log-structured merge tree. Riak provides similar functionality with a log-structured hash table. CouchDB has modified the traditional B+Tree so that all changes to the data structure are appended to the structure on physical storage. These techniques result in improved write throughput, but require a periodic log compaction to keep the log from growing unbounded.

Increase Throughput by Grouping Writes

Cassandra groups multiple concurrent updates within a short window into a single fsync call. This design, called group commit, results in higher latency per update, as users have to wait on several concurrent updates to have their own update be acknowledged. The latency bump comes at an increase in throughput, as multiple log appends can happen with a single fsync. As of this writing, every HBase update is persisted to the underlying storage provided by the Hadoop Distributed File System (HDFS)15, which has recently seen patches to allow support of appends that respect fsync and group commit.

13.3.2. Multi-server Durability

Because hard drives and machines often irreparably fail, copying important data across machines is necessary. Many NoSQL systems offer multi-server durability for data.

Redis takes a traditional master-slave approach to replicating data. All operations executed against a master are communicated in a log-like fashion to slave machines, which replicate the operations on their own hardware. If a master fails, a slave can step in and serve the data from the state of the operation log that it received from the master. This configuration might result in some data loss, as the master does not confirm that the slave has persisted an operation in its log before acknowledging the operation to the user. CouchDB facilitates a similar form of directional replication, where servers can be configured to replicate changes to documents on other stores.

MongoDB provides the notion of replica sets, where some number of servers are responsible for storing each document. MongoDB gives developers the option of ensuring that all replicas have received updates, or to proceed without ensuring that replicas have the most recent data. Many of the other distributed NoSQL storage systems support multi-server replication of data. HBase, which is built on top of HDFS, receives multi-server durability through HDFS. All writes are replicated to two or more HDFS nodes before returning control to the user, ensuring multi-server durability.

Riak, Cassandra, and Voldemort support more configurable forms of replication. With subtle differences, all three systems allow the user to specify N, the number of machines which should ultimately have a copy of the data, and W<N, the number of machines that should confirm the data has been written before returning control to the user.

To handle cases where an entire data center goes out of service, multi-server replication across data centers is required. Cassandra, HBase, and Voldemort have rack-aware configurations, which specify the rack or data center in which various machines are located. In general, blocking the user’s request until a remote server has acknowledged an update incurs too much latency. Updates are streamed without confirmation when performed across wide area networks to backup data centers.

13.4. Scaling for Performance

Having just spoken about handling failure, let’s imagine a rosier situation: success! If the system you build reaches success, your data store will be one of the components to feel stress under load. A cheap and dirty solution to such problems is to scale up your existing machinery: invest in more RAM and disks to handle the workload on one machine. With more success, pouring money into more expensive hardware will become infeasible. At this point, you will have to replicate data and spread requests across multiple machines to distribute load. This approach is called scale out, and is measured by the horizontal scalability of your system.

The ideal horizontal scalability goal is linear scalability, in which doubling the number of machines in your storage system doubles the query capacity of the system. The key to such scalability is in how the data is spread across machines. Sharding is the act of splitting your read and write workload across multiple machines to scale out your storage system. Sharding is fundamental to the design of many systems, namely Cassandra, HBase, Voldemort, and Riak, and more recently MongoDB and Redis. Some projects such as CouchDB focus on single-server performance and do not provide an in-system solution to sharding, but secondary projects provide coordinators to partition the workload across independent installations on multiple machines.

Let’s cover a few interchangeable terms you might encounter. We will use the terms sharding and partitioning interchangeably. The terms machine, server, or node refer to some physical computer which stores part of the partitioned data. Finally, a cluster or ring refers to the set of machines which participate in your storage system.

Sharding means that no one machine has to handle the write workload on the entire dataset, but no one machine can answer queries about the entire dataset. Most NoSQL systems are key-oriented in both their data and query models, and few queries touch the entire dataset anyway. Because the primary access method for data in these systems is key-based, sharding is typically key-based as well: some function of the key determines the machine on which a key-value pair is stored. We’ll cover two methods of defining the key-machine mapping: hash partitioning and range partitioning.

13.4.1. Do Not Shard Until You Have To

Sharding adds system complexity, and where possible, you should avoid it. Let’s cover two ways to scale without sharding: read replicas and caching.

Read Replicas

Many storage systems see more read requests than write requests. A simple solution in these cases is to make copies of the data on multiple machines. All write requests still go to a master node. Read requests go to machines which replicate the data, and are often slightly stale with respect to the data on the write master.

If you are already replicating your data for multi-server durability in a master-slave configuration, as is common in Redis, CouchDB, or MongoDB, the read slaves can shed some load from the write master. Some queries, such as aggregate summaries of your dataset, which might be expensive and often do not require up-to-the-second freshness, can be executed against the slave replicas. Generally, the less stringent your demands for freshness of content, the more you can lean on read slaves to improve read-only query performance.


Caching the most popular content in your system often works surprisingly well. Memcached dedicates blocks of memory on multiple servers to cache data from your data store. Memcached clients take advantage of several horizontal scalability tricks to distribute load across Memcached installations on different servers. To add memory to the cache pool, just add another Memcached host.

Because Memcached is designed for caching, it does not have as much architectural complexity as the persistent solutions for scaling workloads. Before considering more complicated solutions, think about whether caching can solve your scalability woes. Caching is not solely a temporary band-aid: Facebook has Memcached installations in the range of tens of terabytes of memory!

Read replicas and caching allow you to scale up your read-heavy workloads. When you start to increase the frequency of writes and updates to your data, however, you will also increase the load on the master server that contains all of your up-to-date data. For the rest of this section, we will cover techniques for sharding your write workload across multiple servers.

13.4.2. Sharding Through Coordinators

The CouchDB project focuses on the single-server experience. Two projects, Lounge and BigCouch, facilitate sharding CouchDB workloads through an external proxy, which acts as a front end to standalone CouchDB instances. In this design, the standalone installations are not aware of each other. The coordinator distributes requests to individual CouchDB instances based on the key of the document being requested.

Twitter has built the notions of sharding and replication into a coordinating framework called Gizzard16. Gizzard takes standalone data stores of any type—you can build wrappers for SQL or NoSQL storage systems—and arranges them in trees of any depth to partition keys by key range. For fault tolerance, Gizzard can be configured to replicate data to multiple physical machines for the same key range.

13.4.3. Consistent Hash Rings

Good hash functions distribute a set of keys in a uniform manner. This makes them a powerful tool for distributing key-value pairs among multiple servers. The academic literature on a technique called consistent hashing is extensive, and the first applications of the technique to data stores was in systems called distributed hash tables (DHTs). NoSQL systems built around the principles of Amazon’s Dynamo adopted this distribution technique, and it appears in Cassandra, Voldemort, and Riak.

Hash Rings by Example

[A Distributed Hash Table Ring]Figure 13.1: A Distributed Hash Table Ring

Consistent hash rings work as follows. Say we have a hash function H that maps keys to uniformly distributed large integer values. We can form a ring of numbers in the range [1, L] that wraps around itself with these values by taking H(key) mod L for some relatively large integer L. This will map each key into the range [1,L]. A consistent hash ring of servers is formed by taking each server’s unique identifier (say its IP address), and applying H to it. You can get an intuition for how this works by looking at the hash ring formed by five servers (AE) in Figure 13.1.

There, we picked L = 1000. Let’s say that H(A) mod L = 7, H(B) mod L = 234, H(C) mod L = 447, H(D) mod L = 660, and H(E) mod L = 875. We can now tell which server a key should live on. To do this, we map all keys to a server by seeing if it falls in the range between that server and the next one in the ring. For example, A is responsible for keys whose hash value falls in the range [7,233], and E is responsible for keys in the range [875, 6] (this range wraps around on itself at 1000). So if H('employee30') mod L = 899, it will be stored by server E, and if H('employee31') mod L = 234, it will be stored on server B.

Replicating Data

Replication for multi-server durability is achieved by passing the keys and values in one server’s assigned range to the servers following it in the ring. For example, with a replication factor of 3, keys mapped to the range [7,233] will be stored on servers A, B, and C. If A were to fail, its neighbors B and C would take over its workload. In some designs, E would replicate and take over A‘s workload temporarily, since its range would expand to include A‘s.

Achieving Better Distribution

While hashing is statistically effective at uniformly distributing a keyspace, it usually requires many servers before it distributes evenly. Unfortunately, we often start with a small number of servers that are not perfectly spaced apart from one-another by the hash function. In our example, A‘s key range is of length 227, whereas E‘s range is 132. This leads to uneven load on different servers. It also makes it difficult for servers to take over for one-another when they fail, since a neighbor suddenly has to take control of the entire range of the failed server.

To solve the problem of uneven large key ranges, many DHTs including Riak create several `virtual’ nodes per physical machine. For example, with 4 virtual nodes, server A will act as server A_1, A_2, A_3, and A_4. Each virtual node hashes to a different value, giving it more opportunity to manage keys distributed to different parts of the keyspace. Voldemort takes a similar approach, in which the number of partitions is manually configured and usually larger than the number of servers, resulting in each server receiving a number of smaller partitions.

Cassandra does not assign multiple small partitions to each server, resulting in sometimes uneven key range distributions. For load-balancing, Cassandra has an asynchronous process which adjusts the location of servers on the ring depending on their historic load.

13.4.4. Range Partitioning

In the range partitioning approach to sharding, some machines in your system keep metadata about which servers contain which key ranges. This metadata is consulted to route key and range lookups to the appropriate servers. Like the consistent hash ring approach, this range partitioning splits the keyspace into ranges, with each key range being managed by one machine and potentially replicated to others. Unlike the consistent hashing approach, two keys that are next to each other in the key’s sort order are likely to appear in the same partition. This reduces the size of the routing metadata, as large ranges are compressed to [start, end] markers.

In adding active record-keeping of the range-to-server mapping, the range partitioning approach allows for more fine-grained control of load-shedding from heavily loaded servers. If a specific key range sees higher traffic than other ranges, a load manager can reduce the size of the range on that server, or reduce the number of shards that this server serves. The added freedom to actively manage load comes at the expense of extra architectural components which monitor and route shards.

The BigTable Way

Google’s BigTable paper describes a range-partitioning hierarchical technique for sharding data into tablets. A tablet stores a range of row keys and values within a column family. It maintains all of the necessary logs and data structures to answer queries about the keys in its assigned range. Tablet servers serve multiple tablets depending on the load each tablet is experiencing.

Each tablet is kept at a size of 100-200 MB. As tablets change in size, two small tablets with adjoining key ranges might be combined, or a large tablet might be split in two. A master server analyzes tablet size, load, and tablet server availability. The master adjusts which tablet server serves which tablets at any time.

[BigTable-based Range Partitioning]Figure 13.2: BigTable-based Range Partitioning

The master server maintains the tablet assignment in a metadata table. Because this metadata can get large, the metadata table is also sharded into tablets that map key ranges to tablets and tablet servers responsible for those ranges. This results in a three-layer hierarchy traversal for clients to find a key on its hosting tablet server, as depicted in Figure 13.2.

Let’s look at an example. A client searching for key 900 will query server A, which stores the tablet for metadata level 0. This tablet identifies the metadata level 1 tablet on server 6 containing key ranges 500-1500. The client sends a request to server B with this key, which responds that the tablet containing keys 850-950 is found on a tablet on server C. Finally, the client sends the key request to server C, and gets the row data back for its query. Metadata tablets at level 0 and 1 may be cached by the client, which avoids putting undue load on their tablet servers from repeat queries. The BigTable paper explains that this 3-level hierarchy can accommodate 261 bytes worth of storage using 128MB tablets.

Handling Failures

The master is a single point of failure in the BigTable design, but can go down temporarily without affecting requests to tablet servers. If a tablet server fails while serving tablet requests, it is up to the master to recognize this and re-assign its tablets while requests temporarily fail.

In order to recognize and handle machine failures, the BigTable paper describes the use of Chubby, a distributed locking system for managing server membership and liveness. ZooKeeper17 is the open source implementation of Chubby, and several Hadoop-based projects utilize it to manage secondary master servers and tablet server reassignment.

Range Partitioning-based NoSQL Projects

HBase employs BigTable’s hierarchical approach to range-partitioning. Underlying tablet data is stored in Hadoop’s distributed filesystem (HDFS). HDFS handles data replication and consistency among replicas, leaving tablet servers to handle requests, update storage structures, and initiate tablet splits and compactions.

MongoDB handles range partitioning in a manner similar to that of BigTable. Several configuration nodes store and manage the routing tables that specify which storage node is responsible for which key ranges. These configuration nodes stay in sync through a protocol called two-phase commit, and serve as a hybrid of BigTable’s master for specifying ranges and Chubby for highly available configuration management. Separate routing processes, which are stateless, keep track of the most recent routing configuration and route key requests to the appropriate storage nodes. Storage nodes are arranged in replica sets to handle replication.

Cassandra provides an order-preserving partitioner if you wish to allow fast range scans over your data. Cassandra nodes are still arranged in a ring using consistent hashing, but rather than hashing a key-value pair onto the ring to determine the server to which it should be assigned, the key is simply mapped onto the server which controls the range in which the key naturally fits. For example, keys 20 and 21 would both be mapped to server A in our consistent hash ring in Figure 13.1, rather than being hashed and randomly distributed in the ring.

Twitter’s Gizzard framework for managing partitioned and replicated data across many back ends uses range partitioning to shard data. Routing servers form hierarchies of any depth, assigning ranges of keys to servers below them in the hierarchy. These servers either store data for keys in their assigned range, or route to yet another layer of routing servers. Replication in this model is achieved by sending updates to multiple machines for a key range. Gizzard routing nodes manage failed writes in different manner than other NoSQL systems. Gizzard requires that system designers make all updates idempotent (they can be run twice). When a storage node fails, routing nodes cache and repeatedly send updates to the node until the update is confirmed.

13.4.5. Which Partitioning Scheme to Use

Given the hash- and range-based approaches to sharding, which is preferable? It depends. Range partitioning is the obvious choice to use when you will frequently be performing range scans over the keys of your data. As you read values in order by key, you will not jump to random nodes in the network, which would incur heavy network overhead. But if you do not require range scans, which sharding scheme should you use?

Hash partitioning gives reasonable distribution of data across nodes, and random skew can be reduced with virtual nodes. Routing is simple in the hash partitioning scheme: for the most part, the hash function can be executed by clients to find the appropriate server. With more complicated rebalancing schemes, finding the right node for a key becomes more difficult.

Range partitioning requires the upfront cost of maintaining routing and configuration nodes, which can see heavy load and become central points of failure in the absence of relatively complex fault tolerance schemes. Done well, however, range-partitioned data can be load-balanced in small chunks which can be reassigned in high-load situations. If a server goes down, its assigned ranges can be distributed to many servers, rather than loading the server’s immediate neighbors during downtime.

13.5. Consistency

Having spoken about the virtues of replicating data to multiple machines for durability and spreading load, it’s time to let you in on a secret: keeping replicas of your data on multiple machines consistent with one-another is hard. In practice, replicas will crash and get out of sync, replicas will crash and never come back, networks will partition two sets of replicas, and messages between machines will get delayed or lost. There are two major approaches to data consistency in the NoSQL ecosystem. The first is strong consistency, where all replicas remain in sync. The second is eventual consistency, where replicas are allowed to get out of sync, but eventually catch up with one-another. Let’s first get into why the second option is an appropriate consideration by understanding a fundamental property of distributed computing. After that, we’ll jump into the details of each approach.

13.5.1. A Little Bit About CAP

Why are we considering anything short of strong consistency guarantees over our data? It all comes down to a property of distributed systems architected for modern networking equipment. The idea was first proposed by Eric Brewer as the CAP Theorem, and later proved by Gilbert and Lynch [GL02]. The theorem first presents three properties of distributed systems which make up the acronym CAP:

  • Consistency: do all replicas of a piece of data always logically agree on the same version of that data by the time you read it? (This concept of consistency is different than the C in ACID.)
  • Availability: Do replicas respond to read and write requests regardless of how many replicas are inaccessible?
  • Partition tolerance: Can the system continue to operate even if some replicas temporarily lose the ability to communicate with each other over the network?

The theorem then goes on to say that a storage system which operates on multiple computers can only achieve two of these properties at the expense of a third. Also, we are forced to implement partition-tolerant systems. On current networking hardware using current messaging protocols, packets can be lost, switches can fail, and there is no way to know whether the network is down or the server you are trying to send a message to is unavailable. All NoSQL systems should be partition-tolerant. The remaining choice is between consistency and availability. No NoSQL system can provide both at the same time.

Opting for consistency means that your replicated data will not be out of sync across replicas. An easy way to achieve consistency is to require that all replicas acknowledge updates. If a replica goes down and you can not confirm data updates on it, then you degrade availability on its keys. This means that until all replicas recover and respond, the user can not receive successful acknowledgment of their update operation. Thus, opting for consistency is opting for a lack of round-the-clock availability for each data item.

Opting for availability means that when a user issues an operation, replicas should act on the data they have, regardless of the state of other replicas. This may lead to diverging consistency of data across replicas, since they weren’t required to acknowledge all updates, and some replicas may have not noted all updates.

The implications of the CAP theorem lead to the strong consistency and eventual consistency approaches to building NoSQL data stores. Other approaches exist, such as the relaxed consistency and relaxed availability approach presented in Yahoo!’s PNUTS [CRS+08] system. None of the open source NoSQL systems we discuss has adopted this technique yet, so we will not discuss it further.

13.5.2. Strong Consistency

Systems which promote strong consistency ensure that the replicas of a data item will always be able to come to consensus on the value of a key. Some replicas may be out of sync with one-another, but when the user asks for the value of employee30:salary, the machines have a way to consistently agree on the value the user sees. How this works is best explained with numbers.

Say we replicate a key on N machines. Some machine, perhaps one of the N, serves as a coordinator for each user request. The coordinator ensures that a certain number of the N machines has received and acknowledged each request. When a write or update occurs to a key, the coordinator does not confirm with the user that the write occurred until W replicas confirm that they have received the update. When a user wants to read the value for some key, the coordinator responds when at least R have responded with the same value. We say that the system exemplifies strong consistency if R+W>N.

Putting some numbers to this idea, let’s say that we’re replicating each key across N=3 machines (call them A, B, and C). Say that the key employee30:salary is initially set to the value $20,000, but we want to give employee30 a raise to $30,000. Let’s require that at least W=2 of A, B, or C acknowledge each write request for a key. When A and B confirm the write request for (employee30:salary, $30,000), the coordinator lets the user know that employee30:salary is safely updated. Let’s assume that machine C never received the write request for employee30:salary, so it still has the value $20,000. When a coordinator gets a read request for key employee30:salary, it will send that request to all 3 machines:

  • If we set R=1, and machine C responds first with $20,000, our employee will not be very happy.
  • However, if we set R=2, the coordinator will see the value from C, wait for a second response from A or B, which will conflict with C‘s outdated value, and finally receive a response from the third machine, which will confirm that $30,000 is the majority opinion.

So in order to achieve strong consistency in this case, we need to set R=2} so that R+W3}.

What happens when W replicas do not respond to a write request, or R replicas do not respond to a read request with a consistent response? The coordinator can timeout eventually and send the user an error, or wait until the situation corrects itself. Either way, the system is considered unavailable for that request for at least some time.

Your choice of R and W affect how many machines can act strangely before your system becomes unavailable for different actions on a key. If you force all of your replicas to acknowledge writes, for example, then W=N, and write operations will hang or fail on any replica failure. A common choice is R + W = N + 1, the minimum required for strong consistency while still allowing for temporary disagreement between replicas. Many strong consistency systems opt for W=N and R=1, since they then do not have to design for nodes going out of sync.

HBase bases its replicated storage on HDFS, a distributed storage layer. HDFS provides strong consistency guarantees. In HDFS, a write cannot succeed until it has been replicated to all N (usually 2 or 3) replicas, so W = N. A read will be satisfied by a single replica, so R = 1. To avoid bogging down write-intensive workloads, data is transferred from the user to the replicas asynchronously in parallel. Once all replicas acknowledge that they have received copies of the data, the final step of swapping the new data in to the system is performed atomically and consistently across all replicas.

13.5.3. Eventual Consistency

Dynamo-based systems, which include Voldemort, Cassandra, and Riak, allow the user to specify N, R, and W to their needs, even if R + W <= N. This means that the user can achieve either strong or eventual consistency. When a user picks eventual consistency, and even when the programmer opts for strong consistency but W is less than N, there are periods in which replicas might not see eye-to-eye. To provide eventual consistency among replicas, these systems employ various tools to catch stale replicas up to speed. Let’s first cover how various systems determine that data has gotten out of sync, then discuss how they synchronize replicas, and finally bring in a few dynamo-inspired methods for speeding up the synchronization process.

Versioning and Conflicts

Because two replicas might see two different versions of a value for some key, data versioning and conflict detection is important. The dynamo-based systems use a type of versioning called vector clocks. A vector clock is a vector assigned to each key which contains a counter for each replica. For example, if servers A, B, and C are the three replicas of some key, the vector clock will have three entries, (N_A, N_B, N_C), initialized to (0,0,0).

Each time a replica modifies a key, it increments its counter in the vector. If B modifies a key that previously had version (39, 1, 5), it will change the vector clock to (39, 2, 5). When another replica, say C, receives an update from B about the key’s data, it will compare the vector clock from B to its own. As long as its own vector clock counters are all less than the ones delivered from B, then it has a stale version and can overwrite its own copy with B‘s. If B and C have clocks in which some counters are greater than others in both clocks, say (39, 2, 5) and (39, 1, 6), then the servers recognize that they received different, potentially unreconcilable updates over time, and identify a conflict.

Conflict Resolution

Conflict resolution varies across the different systems. The Dynamo paper leaves conflict resolution to the application using the storage system. Two versions of a shopping cart can be merged into one without significant loss of data, but two versions of a collaboratively edited document might require human reviewer to resolve conflict. Voldemort follows this model, returning multiple copies of a key to the requesting client application upon conflict.

Cassandra, which stores a timestamp on each key, uses the most recently timestamped version of a key when two versions are in conflict. This removes the need for a round-trip to the client and simplifies the API. This design makes it difficult to handle situations where conflicted data can be intelligently merged, as in our shopping cart example, or when implementing distributed counters. Riak allows both of the approaches offered by Voldemort and Cassandra. CouchDB provides a hybrid: it identifies a conflict and allows users to query for conflicted keys for manual repair, but deterministically picks a version to return to users until conflicts are repaired.

Read Repair

If R replicas return non-conflicting data to a coordinator, the coordinator can safely return the non-conflicting data to the application. The coordinator may still notice that some of the replicas are out of sync. The Dynamo paper suggests, and Cassandra, Riak, and Voldemort implement, a technique called read repair for handling such situations. When a coordinator identifies a conflict on read, even if a consistent value has been returned to the user, the coordinator starts conflict-resolution protocols between conflicted replicas. This proactively fixes conflicts with little additional work. Replicas have already sent their version of the data to the coordinator, and faster conflict resolution will result in less divergence in the system.

Hinted Handoff

Cassandra, Riak, and Voldemort all employ a technique called hinted handoff to improve write performance for situations where a node temporarily becomes unavailable. If one of the replicas for a key does not respond to a write request, another node is selected to temporarily take over its write workload. Writes for the unavailable node are kept separately, and when the backup node notices the previously unavailable node become available, it forwards all of the writes to the newly available replica. The Dynamo paper utilizes a ‘sloppy quorum’ approach and allows the writes accomplished through hinted handoff to count toward the W required write acknowledgments. Cassandra and Voldemort will not count a hinted handoff against W, and will fail a write which does not have W confirmations from the originally assigned replicas. Hinted handoff is still useful in these systems, as it speeds up recovery when an unavailable node returns.


When a replica is down for an extended period of time, or the machine storing hinted handoffs for an unavailable replica goes down as well, replicas must synchronize from one-another. In this case, Cassandra and Riak implement a Dynamo-inspired process called anti-entropy. In anti-entropy, replicas exchange Merkle Trees to identify parts of their replicated key ranges which are out of sync. A Merkle tree is a hierarchical hash verification: if the hash over the entire keyspace is not the same between two replicas, they will exchange hashes of smaller and smaller portions of the replicated keyspace until the out-of-sync keys are identified. This approach reduces unnecessary data transfer between replicas which contain mostly similar data.


Finally, as distributed systems grow, it is hard to keep track of how each node in the system is doing. The three Dynamo-based systems employ an age-old high school technique known as gossip to keep track of other nodes. Periodically (every second or so), a node will pick a random node it once communicated with to exchange knowledge of the health of the other nodes in the system. In providing this exchange, nodes learn which other nodes are down, and know where to route clients in search of a key.

13.6. A Final Word

The NoSQL ecosystem is still in its infancy, and many of the systems we’ve discussed will change architectures, designs, and interfaces. The important takeaways in this chapter are not what each NoSQL system currently does, but rather the design decisions that led to a combination of features that make up these systems. NoSQL leaves a lot of design work in the hands of the application designer. Understanding the architectural components of these systems will not only help you build the next great NoSQL amalgamation, but also allow you to use current versions responsibly.

13.7. Acknowledgments

I am grateful to Jackie Carter, Mihir Kedia, and the anonymous reviewers for their comments and suggestions to improve the chapter. This chapter would also not be possible without the years of dedicated work of the NoSQL community. Keep building!


  1. http://hbase.apache.org/
  2. http://project-voldemort.com/
  3. http://cassandra.apache.org/
  4. http://code.google.com/p/protobuf/
  5. http://thrift.apache.org/
  6. http://avro.apache.org/
  7. http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html
  8. http://redis.io/
  9. http://couchdb.apache.org/
  10. http://www.mongodb.org/
  11. http://www.basho.com/products_riak_overview.php
  12. http://www.hypergraphdb.org/index
  13. http://neo4j.org/
  14. http://memcached.org/
  15. http://hadoop.apache.org/hdfs/
  16. http://github.com/twitter/gizzard
  17. http://hadoop.apache.org/zookeeper/

[repost ]Brewer’s CAP Theorem


On Friday 4th June 1976, in a small upstairs room away from the main concert auditorium, the Sex Pistols kicked off their first gig at Manchester’sLesser Free Trade Hall. There’s some confusion as to who exactly was there in the audience that night, partly because there was another concert just six weeks later, but mostly because it’s considered to be a gig that changed western music culture forever. So iconic and important has that appearance become that David Nolan wrote a book, I Swear I Was There: The Gig That Changed the World, investigating just whose claim to have been present was justified. Because the 4th of June is generally considered to be the genesis of punk rock6.

Prior to this (in fact since around 1971) there had been a number of protopunk bands, such as the New York Dolls and the Velvet Underground, but it was this one set by the Sex Pistols that in music folklore started the revolution that set in motion the driving guitars of the Buzzcocks, the plaintive wailing of The Smiths, the eclectic syncopations of the The Fall, the rising majesty of Joy Division and Simply Red (I guess you can’t have everything).

Wednesday 19th July 2000, may not go down in popular culture with quite the same magnitude but it’s had a similar impact on internet scale business as the Sex Pistols did on music a quarter of a century earlier, for that was the keynote speech by Eric Brewer at the ACM Symposium on thePrinciples of Distributed Computing (PODC).

The Sex Pistols had shown that barely-constrained fury was more important to their contemporaries than art-school structuralism, giving anyone with three chords and something to say permission to start a band. Eric Brewer, in what became known as Brewer’s Conjecture, said that as applications become more web-based we should stop worrying about data consistency, because if we want high availability in these new distributed applications, then guaranteed consistency of data is something we cannot have, thus giving anyone with three servers and a keen eye for customer experience permission to start an internet scale business. Disciples of Brewer (present that day or later converts) include the likes of AmazonEBay, and Twitter.

Two years later, in 2002, Seth Gilbert and Nancy Lynch of MIT, formally proved Brewer to be correct and thus Brewer’s Theorem was born.

Brewer’s (CAP) Theorem

So what exactly is Brewer’s Theorem, and why does it warrant comparison with a 1976 punk gig in Manchester?

Brewer’s 2000 talk was based on his theoretical work at UC Berkley and observations from running Inktomi, though Brewer and others were talking about trade-off decisions that need to be made in highly scalable systems years before that (e.g. “Cluster-Based Scalable Network Services from SOSP in 1997 and “Harvest, yield, and scalable tolerant systems in 1999) so the contents of the presentation weren’t new and, like many of these ideas, they were the work of many smart people (as I am sure Brewer himself would be quick to point out).

What he said was there are three core systemic requirements that exist in a special relationship when it comes to designing and deploying applications in a distributed environment (he was talking specifically about the web but so many corporate businesses are multi-site/multi-country these days that the effects could equally apply to your data-centre/LAN/WAN arrangement).

The three requirements are: ConsistencyAvailability and Partition Tolerance, giving Brewer’s Theorem its other name – CAP.

To give these some real-world meaning let’s use a simple example: you want to buy a copy of Tolstoy‘s War & Peace to read on a particularly long vacation you’re starting tomorrow. Your favourite web bookstore has one copy left in stock. You do your search, check that it can be delivered before you leave and add it to your basket. You remember that you need a few other things so you browse the site for a bit (have you ever bought just onething online? Gotta maximise the parcel dollar). While you’re reading the customer reviews of a suntan lotion product, someone, somewhere else in the country, arrives at the site, adds a copy to their basket and goes right to the checkout process (they need an urgent fix for a wobbly table with one leg much shorter than the others).

  • Consistency

    A service that is consistent operates fully or not at all. Gilbert and Lynch use the word “atomic” instead of consistent in their proof, which makes more sense technically because, strictly speaking, consistent is the C in ACID as applied to the ideal properties of database transactions and means that data will never be persisted that breaks certain pre-set constraints. But if you consider it a preset constraint of distributed systems that multiple values for the same piece of data are not allowed then I think the leak in the abstraction is plugged (plus, if Brewer had used the word atomic, it would be called the AAP theorem and we’d all be in hospital every time we tried to pronounce it).

    In the book buying example you can add the book to your basket, or fail. Purchase it, or not. You can’t half-add or half-purchase a book. There’s one copy in stock and only one person will get it the next day. If both customers can continue through the order process to the end (i.e. make payment) the lack of consistency between what’s in stock and what’s in the system will cause an issue. Maybe not a huge issue in this case – someone’s either going to be bored on vacation or spilling soup – but scale this up to thousands of inconsistencies and give them a monetary value (e.g. trades on a financial exchange where there’s an inconsistency between what you think you’ve bought or sold and what the exchange record states) and it’s a huge issue.

    We might solve consistency by utilising a database. At the correct moment in the book order process the number of War and Peace books-in-stock is decremented by one. When the other customer reaches this point, the cupboard is bare and the order process will alert them to this without continuing to payment. The first operates fully, the second not at all.

    Databases are great at this because they focus on ACID properties and give us Consistency by also giving us Isolation, so that when Customer One is reducing books-in-stock by one, and simultaneously increasing books-in-basket by one, any intermediate states are isolated from Customer Two, who has to wait a few milliseconds while the data store is made consistent.

  • Availability

    Availability means just that – the service is available (to operate fully or not as above). When you buy the book you want to get a response, not some browser message about the web site being uncommunicative. Gilbert & Lynch in their proof of CAP Theorem make the good point that availability most often deserts you when you need it most – sites tend to go down at busy periods precisely because they are busy. A service that’s available but not being accessed is of no benefit to anyone.

  • Partition Tolerance

    If your application and database runs on one box then (ignoring scale issues and assuming all your code is perfect) your server acts as a kind of atomic processor in that it either works or doesn’t (i.e. if it has crashed it’s not available, but it won’t cause data inconsistency either).

    Once you start to spread data and logic around different nodes then there’s a risk of partitions forming. A partition happens when, say, a network cable gets chopped, and Node A can no longer communicate with Node B. With the kind of distribution capabilities the web provides, temporary partitions are a relatively common occurrence and, as I said earlier, they’re also not that rare inside global corporations with multiple data centres.

    Gilbert & Lynch defined partition tolerance as:

    No set of failures less than total network failure is allowed to cause the system to respond incorrectly

    and noted Brewer’s comment that a one-node partition is equivalent to a server crash, because if nothing can connect to it, it may as well not be there.

The Significance of the Theorem

CAP Theorem comes to life as an application scales. At low transactional volumes, small latencies to allow databases to get consistent has no noticeable affect on either overall performance or the user experience. Any load distribution you do undertake, therefore, is likely to be for systems management reasons.

But as activity increases, these pinch-points in throughput will begin limit growth and create errors. It’s one thing having to wait for a web page to come back with a response and another experience altogether to enter your credit card details to be met with “HTTP 500 java.lang.schrodinger.purchasingerror” and wonder whether you’ve just paid for something you won’t get, not paid at all, or maybe the error is immaterial to this transaction. Who knows? You are unlikely to continue, more likely to shop elsewhere, and very likely to phone your bank.

Either way this is not good for business. Amazon claim that just an extra one tenth of a second on their response times will cost them 1% in sales. Google said they noticed that just a half a second increase in latency caused traffic to drop by a fifth.

I’ve written a little about scalability before, so won’t repeat all that here except to make two points: the first is that whilst addressing the problems of scale might be an architectural concern, the initial discussions are not. They are business decisions. I get very tired of hearing, from techies, that such-and-such an approach is not warranted because current activity volumes don’t justify it. It’s not that they’re wrong; more often than not they’re quite correct, it’s that to limit scale from the outset is to implicitly make revenue decisions – a factor that should be made explicit during business analysis.

The second point is that once you embark on discussions around how to best scale your application the world falls broadly into two ideological camps: the database crowd and the non-database crowd.

The database crowd, unsurprisingly, like database technology and will tend to address scale by talking of things like optimistic locking and sharding, keeping the database at the heart of things.

The non-database crowd will tend to address scale by managing data outside of the database environment (avoiding the relational world) for as long as possible.

I think it’s fair to say that the former group haven’t taken to CAP Theorem with quite the same gusto as the latter (though they are talking about it). This is because if you have to drop one of consistency, availability, or partition tolerance, many opt to drop consistency which is the raison d’être of the database. The logic. no doubt, is that availability and partition-tolerance keep your money-making application alive, whereas inconsistency just feels like one of those things you can work around with clever design.

Like so much else in IT, it’s not as black and white as this. Eric Brewer, on slide 13 of his PODC talk, when comparing ACID and it’s informal counterpartBASE even says “I think it’s a spectrum”. And if you’re interested in this as a topic (it’s slightly outside of what I want to talk about here) you could do worse than start with a paper called “Design and Evaluation of a Continuous Consistency Model for Replicated Services by Haifeng Yu and Amin Vahdat. Nobody should interpret CAP as implying the database is dead.

Where both sides agree though is that the answer to scale is distributed parallelisation not, as was once thought, supercomputer grunt. Eric Brewer’s influence on the Network of Workstations projects of the mid-nineties led to the architectures that exposed CAP theorem, because as he says in another presentation on Inktomi and the Internet Bubble the answer has always been processors working in parallel:

If they’re not working in parallel you have no chance to get the problem done in a reasonable amount of time. This is a lot like anything else. If you have a really big job to do you get lots of people to do it. So if you are building a bridge you have lots of construction workers. That’s parallel processing also. So a lot of this will end up being “how do we mix parallel processing and the internet?”

The Proof in Pictures

Here’s a simplified proof, in pictures because I find it much easier to understand that way. I’ve mostly used the same terms as Gilbert and Lynch so that this ties up with their paper.

The diagram above shows two nodes in a network, N1 and N2. They both share a piece of data V (how many physical copies of War and Peace are in stock), which has a value V0. Running on N1 is an algorithm called A which we can consider to be safe, bug free, predictable and reliable. Running on N2is a similar algorithm called B. In this experiment, A writes new values of V and B reads values of V.

In a sunny-day scenario this is what happens: (1) First A writes a new value of V, which we’ll call V1. (2) Then a message (M) is passed from N1 to N2which updates the copy of V there. (3) Now any read by B of V will return V1.

If the network partitions (that is messages from N1 to N2 are not delivered) then N2 contains an inconsistent value of V when step (3) occurs.

Hopefully that seems fairly obvious. Scale this is up to even a few hundred transactions and it becomes a major issue. If M is an asynchronous message then N1 has no way of knowing whether N2 gets the message. Even with guaranteed delivery of M, N1 has no way of knowing if a message is delayed by a partition event or something failing in N2. Making M synchronous doesn’t help because that treats the write by A on N1 and the update event from N1 to N2 as an atomic operation, which gives us the same latency issues we have already talked about (or worse). Gilbert and Lynch also prove, using a slight variation on this, that even in a partially-synchronous model (with ordered clocks on each node) atomicity cannot be guaranteed.

So what CAP tells us is that if we want A and B to be highly available (i.e. working with minimal latency) and we want our nodes N1 to Nn (where n could be hundreds or even thousands) to remain tolerant of network partitions (lost messages, undeliverable messages, hardware outages, process failures) then sometimes we are going to get cases where some nodes think that V is V0 (one copy of War and Peace in stock) and other nodes will think that V is V1 (no copies of War and Peace in stock).

We’d really like everything to be structured, consistent and harmonious, like the music of a prog rock band from the early seventies, but what we are faced with is a little bit of punk-style anarchy. And actually, whilst it might scare our grandmothers, it’s OK once you know this, because both can work together quite happily.

Let’s quickly analyse this from a transactional perspective.

If we have a transaction (i.e. unit of work based around the persistent data item V) called α, then α1 could be the write operation from before and α2could be the read. On a local system this would be easily be handled by a database with some simple locking, isolating any attempt to read in α2 until α1 completes safely. In the distributed model though, with nodes N1 and N2 to worry about, the intermediate synchronising message has also to complete. Unless we can control when α2 happens, we can never guarantee it will see the same data values α1 writes. All methods to add control (blocking, isolation, centralised management, etc) will impact either partition tolerance or the availability of α1 (A) and/or α2 (B).

Dealing with CAP

You’ve got a few choices when addressing the issues thrown up by CAP. The obvious ones are:

  1. Drop Partition Tolerance

    If you want to run without partitions you have to stop them happening. One way to do this is to put everything (related to that transaction) on one machine, or in one atomically-failing unit like a rack. It’s not 100% guaranteed because you can still have partial failures, but you’re less likely to get partition-like side-effects. There are, of course, significant scaling limits to this.

  2. Drop Availability

    This is the flip side of the drop-partition-tolerance coin. On encountering a partition event, affected services simply wait until data is consistent and therefore remain unavailable during that time. Controlling this could get fairly complex over many nodes, with re-available nodes needing logic to handle coming back online gracefully.

  3. Drop Consistency

    Or, as Werner Vogels puts it, accept that things will become “Eventually Consistent” (updated Dec 2008). Vogels’ article is well worth a read. He goes into a lot more detail on operational specifics than I do here.

    Lots of inconsistencies don’t actually require as much work as you’d think (meaning continuous consistency is probably not something we need anyway). In my book order example if two orders are received for the one book that’s in stock, the second just becomes a back-order. As long as the customer is told of this (and remember this is a rare case) everybody’s probably happy.

  4. The BASE Jump

    The notion of accepting eventual consistency is supported via an architectural approach known as BASE (Basically Available, Soft-state,Eventually consistent). BASE, as its name indicates, is the logical opposite of ACID, though it would be quite wrong to imply that any architecture should (or could) be based wholly on one or the other. This is an important point to remember, given our industry’s habit of “oooh shiny” strategy adoption.

    And here I defer to Professor Brewer himself who emailed me some comments on this article, saying:

    the term “BASE” was first presented in the 1997 SOSP article that you cite. I came up with acronym with my students in their office earlier that year. I agree it is contrived a bit, but so is “ACID” — much more than people realize, so we figured it was good enough. Jim Gray and I discussed these acronyms and he readily admitted that ACID was a stretch too — the A and D have high overlap and the C is ill-defined at best. But the pair connotes the idea of a spectrum, which is one of the points of the PODC lecture as you correctly point out.

    Dan Pritchett of EBay has a nice presentation on BASE.

  5. Design around it

    Guy Pardon, CTO of atomikos wrote an interesting post which he called “A CAP Solution (Proving Brewer Wrong)“, suggesting an architectural approach that would deliver Consistency, Availability and Partition-tolerance, though with some caveats (notably that you don’t get all three guaranteed in the same instant).

    It’s worth a read as Guy eloquently represents an opposing view in this area.


That you can only guarantee two of Consistency, Availability and Partition Tolerance is real and evidenced by the most successful websites on the planet. If it works for them I see no reason why the same trade-offs shouldn’t be considered in everyday design in corporate environments. If the business explicitly doesn’t want to scale then fine, simpler solutions are available, but it’s a conversation worth having. In any case these discussions will be about appropriate designs for specific operations, not the whole shebang. As Brewer said in his email “the only other thing I would add is that different parts of the same service can choose different points in the spectrum”. Sometimes you absolutely need consistency whatever the scaling cost, because the risk of not having it is too great.

These days I’d go so far as to say that Amazon and EBay don’t have a scalability problem. I think they had one and now they have the tools to address it. That’s why they can freely talk about it. Any scaling they do now (given the size they already are) is really more of the same. Once you’ve scaled, your problems shift to those of operational maintenance, monitoring, rolling out software updates etc. – tough to solve, certainly, but nice to have when you’ve got those revenue streams coming in.


  1. HP’s take on CAP Theorem, a white paper entitled “There is no free lunch with distributed data
  2. Computer Science Notes on Distributed Transactions and Network Partitions from University of Sussex
  3. Nice post by Jens Alfke on databases, scaling and Twitter.
  4. Pat Helland‘s Microsoft paper on distributed transactions and SOA called Data on the Outside versus Data on the Inside, which he later related to CAP Theorem here
  5. Another set of Computer Science course slides, this time from George Mason University in Virginia, on Distributed Software Systems and particularly CAP Theorem and the clash between ACID and BASE ideologies.
  6. The 4th June 1976 is considered to be the birth of Punk Rock in the UK. Thanks to Charlie Dellacona for pointing out that the Ramones take credit for starting the movement in early 1974 in the US, though their official punk recordings are more or less contemporaneous.
  7. Thanks to Hiroshi Yuki, a Japanese translation of this article is available.
  8. Thanks to Daniel Cohen, a two part Hebrew translation of this article is available.
  9. Thanks to Wang Qi, a Chinese translation of this article is available.

[repost]Google Megastore – 3 Billion Writes and 20 Billion Read Transactions Daily

A giant step into the fully distributed future has been taken by the Google App Engine team with the release of their High Replication Datastore. The HRD is targeted at mission critical applications that require data replicated to at least three datacenters, full ACID semantics for entity groups, and lower consistency guarantees across entity groups.

This is a major accomplishment. Few organizations can implement a true multi-datacenter datastore. Other than SimpleDB, how many other publicly accessible database services can operate out of multiple datacenters? Now that capability can be had by anyone. But there is a price, literally and otherwise. Because the HRD uses three times the resources as Google App Engine’s Master/Slave datastatore, it will cost three times as much. And because it is a distributed database, with all that implies in the CAP sense, developers will have to be very careful in how they architect their applications because as costs increased, reliability increased, complexity has increased, and performance has decreased. This is why HRD is targeted ay mission critical applications, you gotta want it, otherwise the Master/Slave datastore makes a lot more sense.

The technical details behind the HRD are described in this paper, Megastore: Providing Scalable, Highly Available Storage for Interactive Services. This is a wonderfully written and accessible paper, chocked full of useful and interesting details. James Hamilton wrote an excellent summary of the paper in Google Megastore: The Data Engine Behind GAE. There are also a few useful threads in Google Groups that go into some more details about how it works, costs, and performance (the original announcement, performance comparison).

Some Megastore highlights:

  • Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS. It has been used internally by Google for several years, on more than 100 production applications, to handle more than three billion write and 20 billion read transactions daily, and store a petabyte of data across many global datacenters.
  • Megastore is a storage system developed to meet the storage requirements of today’s interactive online services. It is novel in that it blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS. It uses synchronous replication to achieve high availability and a consistent view of the data. In brief, it provides fully serializable ACID semantics over distant replicas with low enough latencies to support interactive applications. We accomplish this by taking a middle ground in the RDBMS vs. NoSQL design space: we partition the datastore and replicate each partition separately, providing full ACID semantics within partitions, but only limited consistency guarantees across them. We provide traditional database features, such as secondary indexes, but only those features that can scale within user-tolerable latency limits, and only with the semantics that our partitioning scheme can support. We contend that the data for most Internet services can be suitably partitioned (e.g., by user) to make this approach viable, and that a small, but not spartan, set of features can substantially ease the burden of developing cloud applications.
  • Paxos is used to manage synchronous replication between datacenters. This provides the highest level of availability for reads and writes at the cost of higher-latency writes. Typically Paxos is used only for coordination, Megastore also uses it to perform write operations.
  • Supports 3 levels of read consistency: current, snapshot, and inconsistent reads.
  • Entity groups are now a unit of consistency as well as a unit of transactionality. Entity groups seem to be like little separate databases. Each is independently and synchronously replicated over a wide area. The underlying data is stored in a scalable NoSQL datastore in each datacenter.
  • The App Engine Datastore doesn’t transactions across multiple entity groups because it will greatly limit the write throughput when not operating on an entity group, though Megastore does support these operations.
  • Entity groups are an a priori grouping of data for fast operations. Their size and composition must be balanced. Examples of entity groups are: an email account for a user; a blog would have a profile entity group and more groups to hold posts and meta data for each blog. Each application will have find natural ways to draw entity group boundaries. Fi ne-grained entity groups will force expensive cross-group operations. Groups with too much unrelated data will cause unrelated writes to be serialized which degrades throughput. This a process that ironically seems a little like normalizing and will probably prove just as frustrating.
  • Queries that require strongly consistent results must be restricted to a single entity group. Queries across entity groups may return stale results  This is a major change for programmers. The Master/Slave datastore defaulted to strongly consistent results for all queries, because reads and writes were from the master replica by default. With multiple datacenters the world is a lot ore complicated. This is clear from some the Google group comments too. Performance will vary quite a bit where entities are located and how they are grouped.
  • Applications will remain fully available during planned maintenance periods, as well as during most unplanned infrastructure issues. The Master/Slave datastore was subject to periodic maintenance windows. If availability is job one for your application the HRD is a big win.
  • Backups and redundancy are achieved via synchronous replication, snapshots, and incremental log backups.
  • The datastore API does not change at all
  • Writes to a single entity group are strongly consistent.
  • Writes are limited to an estimated 1 per second per entity group, so HRD is not a good match when high usage is expected. This number is not a strict limit, but a good rule of thumb for write performance.
  • With eventual consistency, more than 99.9% of your writes are available for queries within a few seconds.
  • Only new applications can choose the HRD option. An existing application must be moved to a new application.
  • Performance can be improved at the expense of consistency by setting the read_policy to eventually consistent. This will be bring performance similar to that of Master/Slave datastore. Writes are not affected by this flag,  it only works for read performance, which are already fast in the 99% case, so it doesn’t have a very big impact. Applications performing doing batch gets will notice impressive speed ups from using this flag.
  • One application can’t mix Master/Slave with HRD. The reasoning is HRD can serve out of multiple datacenters and Master/Slave can not, so there’s no way to ensure in failure cases that apps are running in the right place. So if you planned to use an expensive HRD for critical data and the less expensive Master/Slave for less critical data, you can’t do that. You might be thinking to delegate Master/Slave operations to another application, but splitting up applications that way is against the TOS.
  • Once HRD is selected your choice can’t be changed. So if you would like to start with the cheeper Master/Slave for customers who want to pay less and use HRD who would like to pay for a premium service, you can’t do that.
  • There’s no automated migration of Master/Slave data, the HRD data. The application must write that code. The reasoning is the migration will require a read-only period and the application is in the best position to know how to minimize that downtime. There are some tools and code provided to make migration easier.
  • Moving to a caching based architecture will be even more important to hide some of the performance limitations of HRD. Cache can include memcache, cookies, or state put in a URL.
  • Though HRD is based on the Megastore, they do not seem to be the same thing, not every feature of the Megastore may be made available in HRD. To what extent this is true I’m not sure.

Related Articles

Performance can be improved at the expense of consistency by setting the read_policy to eventually consistent. This will be bring performance similar to that of Master/Slave datastore. Writes are not affected by this flag,  it only works for read performance, which are already fast in the 99% case, so it doesn’t have a very big impact. Applications performing doing batch gets will notice impressive speed ups from using this flag.


Google Megastore – 3 Billion Writes and 20 Billion Read Transactions Daily

[repost]VoltDB Decapitates Six SQL Urban Myths and Delivers Internet Scale OLTP in the Process


What do you get when you take a SQL database and start a new implementation from scratch, taking advantage of the latest research and modern hardware? Mike Stonebraker, the sword wielding Johnny Appleseed of the database world, hopes you get something like his new database, VoltDB: a pure SQL, pure ACID, pure OLTP, shared nothing, sharded, scalable, lockless, open source, in-memory DBMS, purpose-built for running hundreds of thousands of transactions a second. VoltDB claims to be 100 times faster than MySQL, up to 13 times faster than Cassandra, and 45 times faster than Oracle, with near-linear scaling.

Will VoltDB kill off the new NoSQL upstarts? Will VoltDB cause a mass extinction of ancient databases? Probably no and no to both questions, but it’s a product with a definite point-of-view and is worth a look as the transaction component in your system. But will it be right for you? Let’s see…

I first heard the details about VoltDB at Gluecon, where Mr. Stonebraker presented his highly entertaining Urban Myths About SQL (slides) talk. The hallways were buzzing long afterwards debating his in-your-face challenge of the existing order. With a refreshing take no prisoners style he lambastes existing SQL products as not good enough, saying they should either get better or die. NoSQL products fair no better as they simply aren’t necessary, their whole existence based on a perception of the weakness of current SQL implementations rather than what SQL could be, if properly implemented.

As an argument against NoSQL, VoltDB simply asks: if you can get a relational database with all the scalable ACIDy goodness, why would you ever stoop to using a NoSQL database that might only ever be eventually reliable?

The attack against existing relational databases systems is to position them as legacy systems that suffer from archaic architectures that are slow by design. They have to deal with disks, threads, and other performance killing constructs that must not be so much evolved as transcended. You see, VoltDB isn’t just competing against NoSQL, it’s aiming squarely at existing relational database vendors by using the patented technological leap play.

It’s a bold two prong strategy, but will the compromises that are part of the reconceptualization of SQL engine architectures prove too limiting for the majority?

VoltDB’s Architecture

The bulk of the rest of this article is about the SQL Myths, but I think touching a bit on Volt’s architecture before we address the myths will help frame the discussion a little better.

John Hugg, from VoltDB Engineering, says:

VoltDB is designed to solve OLTP at internet scale. If you don’t need the scale of VoltDB, then of course you’re going to be much happier with a general system like Postgres that offers so many features and is compatible with so much.

Some other quotes indicating the spirit behind VoltDB:

VoltDB is designed to make difficult or impossible problems manageable.


When we set out to build VoltDB, we figured it wasn’t worth the tradeoffs unless it’s MUCH faster. So it is.

John is not kidding. What matters to VoltDB is: speed at scale, speed at scale, speed at scale, SQL, and ACID. If that matches your priorities then you’ll probably be happy. Otherwise, as you’ll see, everything is sacrificed for speed at scale and what is sacrificed is often ease of use, generality, and error checking. It’s likely we’ll see ease of use improve over time, but for now it looks like rough going, unless of course, you are a going for speed at scale.

Some of the more interesting features are:

  • Main-memory storage. VoltDB stores all its data in RAM. This means there are no buffer pools to manage, so that source of overhead is removed, and there are no blocking disk stalls.
  • Run transactions to completion –single threaded –in timestamp order. Based on the model that 200 record updates is a hefty transaction, you might as well run them to completion. By single threading all locking overhead is removed. On multi-core systems they allocate chunks of memory to each CPU and run each CPU single threaded.
  • Replicas. Persistence is guaranteed by having the data reside in multiple main memories. There are no logs, so there are no disks, which removes the disk overhead. For high availability an active-active architecture is used. Each transaction is run twice, once on each node in the same time-stamp order, which means the replicas are ACID consistent. Data can be asynchronously shipped to disk for a 5% performance hit. VoltDB replication is not a master/slave or primary/backup replication system. Each replica is a first class, fully capable instance of a partition.
  • Tables are partitioned across multiple servers. Partitions are defined in a project XML configuration file that defines the partition keys. Clients can connect through any node. Partitioning is automatic, but you have to decide on the sharding key. They are working on an automated system to give advice on which key to use. If  new nodes are added then the data must reloaded to cause the new nodes to be used, the data is not automatically distributed to the new nodes. Not all tables need to be partitioned. Small, mostly read-only tables can be replicated across all of the partitions of a VoltDB database.
  • Stored procedures, written in Java, are the unit of transaction. Only data changed within a single invocation of a stored procedure is in a transaction, transactions can’t span multiple rounds of communication with a client. Also, from the same source as the previous link,  The vast majority (almost 100%) of your stored procedure invocations must be single partition for VoltDB to be useful to you. If, for example, you partitioned by graph node, updating multiple nodes in a single transaction/procedure will not be a single partition transaction. These and other restrictions show the tradeoff between speed and generality.
  • A limited subset of SQL ’99 is supported. DDL operations like ALTER and DROP aren’t supported. Operations having to do with users, groups and security have been moved into XML configuration files. Updating table structure on the fly is convenient, but it’s not fast, so it’s out. You are also discouraged from doing SUM operations because it would take a long time and block other transactions. Single threading means you must quantize your work into small enough chunks that don’t stall the work pipeline. The goal is to have transactions run in under 50 milliseconds. This is all done for speed.
  • Design a schema and workflow to use single-sited procedures. Data for a table is stored in a partition that is split onto different nodes. Each set of data on a node is called a slice. When a query can run on a single node it it is said to be single-sited. Performance is clearly best when a query can run on just one node against a limited set of data.
  • Challenging operations model. Changing the database schema or reconfiguring the cluster hardware requires first saving and shutting down the database. An exception are stored procedures which can be updated on the fly. In general, choosing speed as the primary design point has made the development and deployment process complicated and limiting. VoltDB, for example, does not support bringing a node back into the cluster while the database is running. All clients must be stopped, the database must be snapshotted, the database must be restarted in a special mode, the data is reloaded, and then clients can be restarted. See Using VoltDB for more details.
  • No WAN support. In the case of a network partition VoltDB chooses consistency over availability, so you will see a hiccup until connectivity can be restored. Out of all the possible failures, Mr. Stonebraker argues, network partitioning is one of the least likely failures, especially compared to programmer error, so choosing strong consistency over availability is the right engineering call. Future versions of VoltDB will do more to address this single-data-center catastrophe scenario.
  • OLAP is purposefully kept separate. VoltDB is only for OLTP. It’s not for reporting or OLAP because those uses require locks be taken which destroys performance. Every update is spooled to a (optional) companion warehouse system that is a second or two behind the main transaction system (yet is perfectly consistent).  The companion system could be something like Vertica, an analytics RDBMS also built by Mr. Stonebraker. The justification for this split of responsibilities is that one size does not fit all. You should run transactions on something that is good at transactions and run reporting on something that’s good at reporting. An specialized transaction architecture will run circles around (50 times to 100 times faster) a one size fits all solution.

VoltDB is different because it has consciously been architected to remove what research shows are the four common sources of overhead in database management systems: logging (19%), latching (19%), locking (17%), B-tree, and buffer management operations (35%). By removing all removing all four sources overhead VoltDB can be really fast while still being ACID. How fast?

VoltDB claims to be 100 times faster than MySQL, up to 13 times faster than Cassandra, and 45 times faster than Oracle, with near-linear scaling. Though I think  linear scaling only applies when you are not using distributed transactions. Two-phase transactions with an in-memory database will be relatively fast, but they will still be slow given the protocol overhead.

Keep in mind VoltDB is in-memory, so it should be fast, that should be a given. But VoltDB says they have gone beyond what other in-memory databases have done, they haven’t just improved buffer management. By removing locks, latching, and threading overhead it’s that much faster than other in-memory databases. You could argue that it’s a waste of RAM, that only hot data should be kept in RAM, but the contention is that RAM can hold the entire data set, so there’s no reason to compromise anymore.

The performance comparison against databases like Cassandra is somewhat of a strawman as they are designed for a much different purpose. Cassandra can store petabytes of data, across hundreds of nodes, across multiple data centers, and new nodes can be added at will. Operationally there’s no comparison. Though I realize the purpose of the benchmarks is to show SQL is not natively slow, can work well for key-value usage patterns, and compares favorably with other industry leaders.

I really like the parallelism of the origin of relational theory with the origin of VoltDB’s architecture. Relational theory was invented to remove update anomalies that occur when storing duplicate data. When data is duplicated, either within a record or in different tables, then it’s easy to cause inconsistencies when performing updates, deletes, and adds. The normalization process makes it much harder to have inconsistencies because facts are stored once and only once. It may be a stretch, but I think of the process of creating a SQL engine architecture based on removing performance anomalies as fascinatingly analogous.

The Six SQL Urban Myths

Here are the six myths that Mr. Stonebraker says NoSQL advocates incorrectly perpetuate:

• Myth #1: SQL is too slow, so use a lower level interface
• Myth #2: I like a K-V interface, so SQL is a non-starter
• Myth #3: SQL systems don’t scale
• Myth #4: There are no open source, scalable SQL engines
• Myth #5: ACID is too slow, so avoid using it
• Myth #6: in CAP, choose AP over CA

Myth #1A: SQL is too slow because of heavy interfaces like ODBC/JDBC


SQL is compiled into an optimized intermediate format in a way similar to how languages like C are compiled into assembly. SQL is not slow. What is slow are heavy interfaces like ODBC/JDBC that cause too many round trips to the database. Performance is determined by this interface.

VoltDB’s Solution:

Only stored procedures are supported. A main advantage stored procedures have over chatty ODBC/JDBC protocols is one message is sent to the database and one reply is sent back. All the computation is in the database. Much more efficient than ODBC/JDBC. To go even faster you can batch stored procedure calls together. Stored procedures are wildly faster than using ODBC/JDBC.

The execution flow is something like:


This is a bit of a strawman argument in that I’ve never heard anyone seriously suggest SQL itself as a language is slow.

Using stored procedures was at one time the canonical relational database architecture. For every operation a stored procedure was created that executed all the required data manipulation and a result was returned. It’s perfectly right to say this is the most efficient path, given certain conditions, but there are many problems:

  1. Stored procedure languages suck. They are difficult to program, ugly to use, and nearly impossible to debug. So programmers escape to the tool chains they know and love, leaving the database to deal with data, not logic. I understand Java is the stored procedure language for VoltDB. Depending on your allegiances that may be the worst or best thing in the world. The more overarching point is that you have no choice, but that may be the price of performance at scale.
  2. Putting logic into the database makes the database an application server, a function for which they are ill equipped. Let’s say during a stored procedure you need to make a REST call to get a discount rate, for example, this involves blocking, IO, threading and all the usual backend server issues that databases don’t know how to do well. VoltDB gets around this issue by simply not allowing you to do this.
  3. Once it can’t scale you are dead, dead, dead. On a few projects I’ve worked on we’ve used the stored procedure approach. It works fine until it doesn’t. At a certain load the database just dies and as a system you are stuck. You can’t make it perform better so you are forced to hack away until all the benefits of using the database are gone and you are left with an expensive and brittle albatross at your system’s core. So that’s why projects have learned not to trust trust the success of their project to how good of an application server your database can be. Instead, they separate out logic from data and let each scale independently. This is a more risk reduced approach. VoltDB counters this objection by allowing new nodes to be added into the system, by limiting the work you can do in the server, by using RAM, and by sharding so you can control how much work is done on each node. The problem is operationally, the process is so onerous.

An interesting fact about stored procedures is that they can take anywhere from half a second to several seconds to prepare a statement in VoltDB 1.0.01, so ad-hoc queries are not practical. VoltDB also does not allow SQL to support getting the schema of tables, listing tables or altering tables. DDL operations aren’t considered part of a core OLTP database.

Myth #1B: SQL is too slow because SQL engine implementations have too much overhead


Traditional relational databases are using 30 year old architectures that make them slow by design. Ninety percent of all CPU cycles in your typical OLTP database go into non-productive work like: managing disk buffer pools, locking, crash recovery, multi-threading. To go faster you need to get rid of the overhead. Traditional relational databases do not do this, they are overhead rich and slow, but this is not the fault of SQL, it’s the implementations that are faulty.

VoltDB’s Solution:

Remove the overhead. VoltDB has made rather radical architectural choices to remove the bottlenecks. The results according to their performance test for single node performance:
• MySQL:  X
• A very popular RDBMS elephant:  2.5 * X
• VoltDB: 100  X

VoltDB is 100 times faster than MySQL in their tests.


Seems spot on to me.

Myth #2: I like a K-V interface, so SQL is a non-starter


Programmers like the convenience that key-value interfaces provide. Put a value by key and get a value by key. Simple. SQL doesn’t provide a key-value interface so programmers won’t use SQL.

VoltDB’s Solution:

Create a thin get/put layer ontop of SQL using stored procedures. They are easy to support on top of a SQL engine. Using gets and puts their tests show that VoltDB is 15 times faster than MySQL and memcached.

  • MySQL/Memcached:  X
  • Cassandra: 7*X
  • VoltDB:  15 * X


Creating a get/put set of stored procedure was at one time standard operating procedure. Any time a table is added so are new stored procedures. What this means is every time any attribute is changed or any table is changed, the effects ripple up and down the entire stack. This is a maintenance nightmare, which reveals one of the major strengths of NoSQL: schemaless design.

I’m sure developers value the ease of putting a value, but what really matters is what that implies, you aren’t tied to a rigid schema that cause nightmares every time a data model changes. Should adding a parameter really cause every interface to break? A simple stored procedure facade completely ignores this aspect of the key-value approach.

Myth #3: SQL systems don’t scale


Some SQL engines don’t support multiple nodes so they don’t scale: MySQL, Postgres, Oracle, SQLServer. Some modern SQL engines do scale linearly: DB2,Vertica, Asterdata, Greenplum, DB2, Vertica, Asterdata, Greenplum, and EnterpriseDB.

VoltDB’s Solution:

Their architecture scales linearly. So far they’ve test on 100 cores on 12 nodes and have scaled linearly.


As discussed previously, you must follow very precise rules about how to partition your system, limit how long queries take, accept a less than agile operations model, and accept that replication equals durability. Speed and scale has its costs, if you are willing to pay VoltDB will probably deliver.

Myth #4: There are no open source, scalable SQL engines

VoltDB is open source. In fact, Mr. Stonebraker believes all future databases will be open source.

Myth #5: ACID is too slow, so avoid using it


Transactions are too expensive which is used by NoSQL vendors as an excuse not to provide real ACID transactions. Some applications require ACID semantics. If the database doesn’t provide this feature then it’s pushed to user code which is the worst of all worlds. Databases live a long time, so if you need ACID later and your database doesn’t support it then you are in trouble.

Mr. Stonebraker contends 99% of all OLTP database are 1TB or less, especially if you factor out static content like pictures. This means transactional database can or will fit in main memory. Facebook is not the common case.

Now combine this with the finding in OLTP through the looking glass, and what we found there that current transactional databases do about 12% useful work, which is the actual cost of doing the retrieves and updates, the rest is spent on logging (19%), latching (19%), locking (17%), B-tree, and buffer management operations (35%). Multi-threading overhead on shared data structures is surprisingly high.

The abstract from the paper:

Online Transaction Processing (OLTP) databases include a suite of features – disk-resident B-trees and heap files, locking-based concurrency control, support for multi-threading – that were optimized for computer technology of the late 1970’s. Advances in modern processors, memories, and networks mean that today’s computers are vastly different from those of 30 years ago, such that many OLTP databases will now fit in main memory, and most OLTP transactions can be processed in milliseconds or less. Yet database architecture has changed little.

Based on this observation, we look at some interesting variants of conventional database systems that one might build that exploit recent hardware trends, and speculate on their performance through a detailed instruction-level breakdown of the major components involved in a transaction processing database system (Shore) running a subset of TPC-C. Rather than simply profiling Shore, we progressively modified it so that after every feature removal or optimization, we had a (faster) working system that fully ran our workload. Overall, we identify overheads and optimizations that explain a total difference of about a factor of 20x in raw performance. We also show that there is no single “high pole in the tent” in modern (memory resident) database systems, but that substantial time is spent in logging, latching, locking, B-tree, and buffer management operations.

There’s not one thing you can do to improve performance as the cost is spread around evenly. Better B-Trees, for example, don’t really help. You are only optimizing the 12% part of the overhead which gets you nowhere.

VoltDB’s Solution:

VoltDB removes all four sources of overhead. Getting rid of ACID only gets you to two times faster, if you want to go ten times faster you need to get rid of the buffer pool overhead and the multithreading overhead too. This is a far more radical architecture change.

The result of getting rid of all four sources of overhead is that VoltDB supports ACID transaction semantics while retaining performance. Transactions can be used all the time with no penalty. The obvious implication being: so why would you ever not use a fast, scalable relational database that is faster than anything else you have ever used?


I assume, if it were possible, everyone would like ACID transactions. To make it possible VoltDB makes a number of restrictions that make VoltDB less than ideal as general purpose database. If you have lots of data then VoltDB wont work for you because lots of data still won’t fit in RAM. If you want easy operations then VoltDB won’t work for you. If you want to use your own language then VoltDB won’t work for you. If you want to have longer lived transactions that integrate inputs from a series of related inputs then VoltDB will not work for you. If you want a little OLAP with your OLTP then VoltDB will not work for you. If you want cross data center availability then VoltDB will not work for you.

And VoltDB never said it would do all that. VoltDB promises damn fast and scalable ACID transactions at all costs. But given all the limitations needed to get there, saying NoSQL simply thinks ACID is too expensive, seems mighty unfair.

Myth #6: in CAP, choose AP over CA


CAP theorem says partitions happen so you have to get rid of consistently or availability, if you prefer availability then you have to jettison consistency.

Mr. Stonebraker says there a few things to think about when making this tradeoff:

  • Order matters. Not all transactions are commutative. So if you get rid of ACID and you have operations where orders matters then eventual consistency will give you the wrong result. If you are transactions are a series of additions and subtractions then the order doesn’t matter. But as soon as you throw in a non-commutative operation like multiplication then you will get wrong results. If you decide later that you need ACID type transactions then it’s a fork-lift upgrade.
  • Semantic constraints. If you have something like stock for an item and you don’t want to send out an item when there is non stock left, when there’s a partition in an eventually consistent system you will not be able to do this as each partition has no idea of what the others are doing. So if you have semantic constraints you need ACID.
  • Errors. Errors are what cause partitions. Humans screwing up are the biggest source of errors. Next are bugs in applications or the database itself. Lastly, network failures cause partitions. Network failures are rare so designing an eventually consistent system for the rarest failure case seems like a bad engineering system. It makes more sense to take the hit for the rare case of a network being down, especially given the number of failures you’ll have because of other non-network related problems is so much higher. Sacrificing consistency for availability is a bad engineering tradeoff.

VoltDB’s Solution:

On network failures VoltDB proudly chooses consistency over availability. Your database is down until the network partition and the database are repaired.


I agree completely that moving the repair logic to the programmer is a recipe for disaster. Having programmers worry about read repair, vector clocks, the commutativity of transactions, how to design compensatory transactions to make up for previous failed transactions, and the other very careful bits of design, is asking for a very fragile system. ACID transactions are clean and understandable and that’s why people like them.

Do you buy the argument that network partitions are rare? To the extent you buy this argument determines how you feel about VoltDB’s design choices. Cloud architectures now assume failure as a starting point. They generally assume nodes within a data center are unreliable and that the more nodes you have and the more data center boundaries you cross, the more reason there is to expect failure. Amazon, for example, made the decision to go with an eventually consistent architecture by using Dynamo to back their shopping carts. Would they have done that for no reason? With that philosophy in mind, Cassandra has embraced failure at its core. Nodes can be added or taken down at any time, all because failure is assumed from the start. This makes for a very robust system. The price is a lack of distributed ACID transactions.

In contrast, the VoltDB model seems from a different age: a small number of highly tended nodes in a centralized location. In return though you get performance and ACID guarantees. It all depends on where you want to place your partition failure bets.

General Discussion

Our Future is Polyglot

VoltDB is a racehorse. It’s a very specialized component that has been bred to do one thing and one thing only: be a very fast and scalable OLTP database. If that’s what you need then VoltDB could be your Triple Crown winner.

What VoltDB reinforces for me is that our future is polyglot. Specialized tools for specialized purposes. VoltDB is specialized for OLTP. It’s a singletasker that doesn’t want to be mediocre at everything, it wants to be great at one thing. A pattern we are seeing more and more. For graph access use a graph database. For search use a search subsystem. And so on.

We Need Cache Consistency Protocols

What we may need is a robust cache coherence protocol between different data silos. In this kind of architecture there’s no real master database other databases can sync to. Every silo has their equally valid perspective on the data. As events stream in to each silo we need some way of keeping all the silos consistent. I have not seen this type of component yet.

Open Source Means We Can Build Complex Interacting Systems

What makes the polyglot future possible is open source. With open source it’s possible to make a system of many complex parts because open source pricing means it doesn’t cost anything until you want extra support. Spending a good percentage of your budget on a mega database means it simply has to do everything because you can’t afford anything else. With open source we can create very powerful composite systems that wouldn’t otherwise be possible to create with lower end budgets.

Related Articles

• Myth #1: SQL is too slow, so use a lower level interface • Myth #2: I like a K-V interface, so SQL is a non-starter • Myth #3: SQL systems don’t scale • Myth #4: There are no open source, scalable SQL engines • Myth #5: ACID is too slow, so avoid using it • Myth #6: in CAP, choose AP over CA