One of the key points I tried to make in my Red Hat Summit talk about GlusterFS last month is that GlusterFS quite deliberately does not trade away data safety or consistency for performance. That’s a painful choice, because everyone always wants to be the speed king and they’ll be sharply critical of anyone they feel is not running as fast as they can. However, one thing I’ve learned about the storage marketplace is that the recognition of priorities other than speed is what separates the pros you can trust from the amateurs you can’t. Harsh, perhaps, but true. Sure, speed matters, but so do robustness and ease of use and cost and features and community and compatibility and a whole bunch of other things. This is especially true when the system is designed to be scalable so that you can address performance issues by adding hardware at linear rather than exponential cost per increment. If you can buy more performance but you can’t buy more of those other things, you’d be a fool to buy the system that’s built for speed and speed alone.
This issue came up again recently when – for what seems like the thousandth time – someone commented on GlusterFS’s poor small-write performance. Well, yeah, because when we say a write is done it’s done. It’s on as many remote servers as you asked for, not just buffered locally like our highest-profile competitor. That’s an example of refusing to sacrifice data safety for the sake of better performance. Similarly, when you list a directory we actually check whether new files have appeared or old ones have been deleted, instead of just returning cached and possibly stale information, so directory listings and general “many small file” workloads tend to perform poorly on GlusterFS compared to systems that take those shortcuts. That’s an example of refusing to sacrifice consistency/correctness for the sake of performance. Sure, we could buffer writes more and cache reads more, and most users would probably not even notice except for the improved performance, but some users would experience failures and even data loss because their expectations (however unrealistic those might be) were not met. Safety and correctness are the defaults, and I shouldn’t even need to defend that position. Where we haven’t done as well is in allowing those defaults to be changed.
The fundamental problem here is a three-way tradeoff: performance vs. consistency vs. simplicity. (If this seems a lot like CAP, or even more likePACELC, that should come as no surprise. Probably because messages are so expensive and you can only do so much with each one, “triangles” like this seem particularly common when dealing with distributed systems.) In this case, performance and consistency are pretty self-explanatory. Simplicity is a bit harder. Far from being a mere matter of aesthetics, simplicity is also a matter of the effort required to make progress in other directions. When a system becomes too complex, it becomes incredibly hard to deal with all of the cases that arise during fault recovery, let alone those that result in slowdowns without a fault. New features are harder to add, new developers are harder to attract, monitoring and ease of use suffer, and so on. In distributed systems, simplicity is not a virtue – it’s a necessity. Thus, if you choose to give up simplicity, a lot of people might not care about your performance and consistency because your system either won’t do what they need or can’t be trusted to keep on doing it. That said, let’s look at what choices various systems have made:
- Sacrifice performance. Obviously I’d put GlusterFS in this category. Some highly available NFS implementations have made the same choice, as have relational databases in general.
- Sacrifice consistency. This is exactly what all of the NoSQL data stores are known for, along with HDFS and “blob stores” like Amazon’s S3 or OpenStack’s Swift. Non-HA implementations of NFS also tend this way, though it could well be argued that they give up very little consistency to very good effect.
- Sacrifice simplicity. AFS (and its descendants) pretty much went this route, which is why most people have barely heard of them. Some might put Lustre and/or NFSv4.x in this category as well.
The point is not to say one set of tradeoffs is better than the others, even though – as with CAP – I feel that one choice leads to a distinctly less useful system than the other two. What’s more important is to realize that a tradeoff is being made, and to understand its nature. I for one would like to see GlusterFS offer more of an option to sacrifice consistency when and how the user chooses to do so, even if that’s not the default behavior. That’s why I’ve worked on asynchronous replication, replication bypass, negative-lookup caching, and a whole bunch of other things that all weaken consistency in a controlled (and modular) way. I’ve spent years trying to implement distributed invalidation-based consistency in the past, I was one of the most stubborn holdouts for that approach, but even I’ve come to believe that it’s firmly in the “too complex” category especially when fault handling is considered. Nonetheless, I still hope to add some sort of TTL-based on-disk client caching some day. The performance “sweet spot” for GlusterFS will continue to grow, including more and more workloads and use cases, but never by trading away characteristics that are even more important.