Some users & customers have asked about the most recent release of Apache Hadoop, v1.0: what’s in it, what it followed and what it preceded. To explain this we should start with some basics of how Apache projects release software:
By and large, in Apache projects new features are developed on a main codeline known as “trunk.” Occasionally very large features are developed on their own branches with the expectation they’ll later merge into trunk. While new features usually land in trunk before they reach a release, there is not much expectation of quality or stability. Periodically, candidate releases are branched from trunk. Once a candidate release is branched it usually stops getting new features. Bugs are fixed and after a vote, a release is declared for that particular branch. Any member of the community can create a branch for a release and name it whatever they like.
This diagram illustrates the history of the various Apache Hadoop releases and their origins. There are 3 occasions where community releases from the Apache Hadoop project broke with what would be a more traditional release & branch convention. These occasions are usually the source of confusion for users.
- More than a year after Apache Hadoop 0.20 branched, significant feature development continued on just that branch and not on trunk. Two major features were added to branches off 0.20.2. One feature was authentication, enabling strong security for core Hadoop. The other major feature was append, enabling users to run Apache HBase without risk of data loss. The security branch was later released as 0.20.203. These branches and their subsequent release have been the largest source of confusion for users because since that time, releases off of the 0.20 branches had features that releases off of trunk did not have and vice versa.
- Apache Hadoop .22 released chronologically after Apache Hadoop 0.23. In actuality Apache Hadoop 0.23 is a strict superset of features over 0.22 but it actually released a month before 0.22.
- A few weeks after 0.23 released, the 0.20 branch formerly known as 0.20.205 was renumbered 1.0. There is next to no functional difference between 0.20.205 and 1.0. This is just a renumbering.
Because of issue #1, there has been an 18 month period where there has been no one Apache release that had all the committed features of Apache Hadoop. This table illustrates the point:
As members of the Apache Hadoop community, Cloudera engineers have focused their efforts on getting back to releases that are strict superset of all of the features of any past releases so as to avoid having to make the unpleasant choice of picking one feature set over another. The good news is minus the confusion over the 1.0 numbering, we are basically there. There have been two good recent releases off of trunk (0.22 and 0.23) one of which (0.23) does have all of the features of any past release. It’s very possible these new releases will get renumbered to 2.0 or 3.0 or some other number to indicate they are functional supersets of 1.0 but this remains to be decided.
Many of you are CDH users and by now you’re wondering what Apache Hadoop you are running today and what Apache Hadoop you’ll be running in the future. This diagram shows the CDH releases and the Apache Hadoop releases they draw from.
The CDH1 distribution incorporated the 0.18.3 Apache Hadoop release. The CDH2 distribution incorporated the 0.20.1 Apache Hadoop release. The CDH3 distribution incorporated the 0.20.2 Apache Hadoop release plus the features of the 0.20.append and 0.20.security branches that collectively are now known as “1.0.” The Apache Hadoop in CDH3 has been the equivalent of the recently announced Apache Hadoop 1.0 for approximately a year now. The CDH4 distribution will likely incorporate a release from the 0.23.x series. We also do quarterly updates for CDH releases. These updates typically include backports from trunk that fix bugs or improve performance & stability, not new component releases. In some cases when it is not destabilizing or compatibility breaking, a CDH update will include an incrementally new component version. For example CDH3U0 uses HBase 0.90.0 whereas CDH3U2 uses HBase 0.90.4.
Cloudera’s Distribution including Apache Hadoop currently incorporates and integrates 13 different open source components to create a single open source Apache Hadoop based data management platform. 11 of the 13 components come from Apache projects, Apache Hadoop being one of them. All of these projects have their own branch and release quirks because each project is a different collection of individuals with different motivations and preferences. This is a feature, not a bug of the Apache community process. By creating an environment where individuals with disparate motivations can all contribute, projects attract more contributors and more innovation.
CDH has a multi-year history of annual releases, quarterly updates, clear upgrade paths and strong policies around maintaining compatibility and stability across updates. This has only been possible because the CDH engineering team is comprised of more than 20 engineers that are committers and PMC members of the various Apache projects who can shape the innovation of the extended community into a single coherent system. It is why we believe demonstrated leadership in open source contribution is the only way to harness the open innovation of the Apache Hadoop ecosystem.
The most current GA release of CDH is CDH3, update 2. Find out more about it here.