Tag Archives: Apache Mahout 0.7 release notes

[repost ]Apache Mahout 0.7 release notes



  • [MAHOUT-981] – Refactor KMeans Clustering into a separate post process with outlier pruning
  • [MAHOUT-982] – Refactor Canopy Clustering into a separate post process with outlier pruning
  • [MAHOUT-983] – Refactor Dirichlet Clustering into a separate post process with outlier pruning
  • [MAHOUT-984] – Refactor Fuzzy K Means Clustering into a separate post process with outlier pruning
  • [MAHOUT-988] – Convert K-means buildClusters to use new ClusterIterator
  • [MAHOUT-989] – Convert fuzzy-K-means buildClusters to use new ClusterIterator
  • [MAHOUT-990] – Convert Dirichlet buildClusters to use new ClusterIterator
  • [MAHOUT-991] – Convert Canopy, MeanShift, K-means, Dirichlet, Fuzzy KMeans and Other Tools to emit ClusterWritable
  • [MAHOUT-1014] – Recreate Newsgroups example using naivebayes package


  • [MAHOUT-399] – LDA on Mahout 0.3 does not converge to correct solution for overlapping pyramids toy problem.
  • [MAHOUT-784] – Exception at 20 Newsgroups examples
  • [MAHOUT-826] – Bayes/CBayes classification on a non-existing feature
  • [MAHOUT-832] – clusterdump job: bug and usability problems
  • [MAHOUT-834] – rowsimilarityjob doesn’t clean it’s temp dir, and fails when seeing it again
  • [MAHOUT-911] – Naive Bayes trains models that are too large to apply
  • [MAHOUT-915] – OutOfMemoryError in EigenVerificationJob
  • [MAHOUT-939] – ASF Email Classification Examples don’t always produce good results
  • [MAHOUT-946] – Map-reduce job status often left unchecked
  • [MAHOUT-951] – StackOverflow Error when using mahout lucene.vector
  • [MAHOUT-955] – Bayes classification result are unstable after classifying non-existing features
  • [MAHOUT-967] – SequenceFileFromMailArchive missing from driver.classes.props
  • [MAHOUT-971] – kmeans does not work in S3
  • [MAHOUT-973] – SparseVectorsFromSequenceFiles will not create a proper TFIDF (bug in TFIDFPartialVectorReducer)
  • [MAHOUT-994] – mahout script shouldn’t rely on HADOOP_HOME since that was deprecated in all major Hadoop branches
  • [MAHOUT-999] – KMeans failing to create correct Clustering Policy
  • [MAHOUT-1005] – Nuke Colt math functions that remain untested – The Matrix edition
  • [MAHOUT-1006] – Example from book no longer works – prepare20newsgroups broken with Lucene upgrade
  • [MAHOUT-1011] – RecommenderJob is ignoring the command line threshold parameter
  • [MAHOUT-1015] – Precondition check in IRStatisticsImpl broken
  • [MAHOUT-1016] – running clusterControlDataWithMeanShift against Hadoop 2.0.0 RC resulted in ClassCastException
  • [MAHOUT-1017] – clusterControlDataWithCanopy, clusterControlDataWithFuzzyKMeans, clusterControlDataWithDirichle examples are looking for output in the wrong place
  • [MAHOUT-1023] – TestFuzzyKmeans is throwing NPE
  • [MAHOUT-1024] – cluster_reuters.sh still relies on old (now removed) lda implementation
  • [MAHOUT-1028] – seq2sparse n-gram weighting creates malformed vectors which crashes kmeans
  • [MAHOUT-1047] – CVB hangs after completion


  • [MAHOUT-768] – Duplicated DoubleFunction in mahout and mahout-collections (mahout.math package).
  • [MAHOUT-782] – Build error with Java JDK 1.7
  • [MAHOUT-822] – Mahout needs to be made compatible with Hadoop .23 releases
  • [MAHOUT-845] – Make cluster top terms code more reusable
  • [MAHOUT-848] – M/R job launching code should add Oozie’s action.xml as a configuration resource of the Hadoop Configuration object
  • [MAHOUT-929] – Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
  • [MAHOUT-930] – Refactor Vector Classifaction out of Clustering – Make Classification abstract
  • [MAHOUT-931] – Implement a pluggable outlier removal capability for cluster classifiers
  • [MAHOUT-933] – Implement mapreduce version of ClusterIterator
  • [MAHOUT-947] – Improvements to seqdumper
  • [MAHOUT-948] – Improved error reporting when ARFF index does not exist in arff.vector [fix provided]
  • [MAHOUT-963] – GenericUserPreferenceArray and GenericItemPreferenceArray use selection sorts
  • [MAHOUT-965] – Added possibility to configure map collection name of MongoDBDataModel
  • [MAHOUT-970] – Make hadoop version overridable
  • [MAHOUT-977] – Thread-safe version of PlusAnonymousUserDataModel with multiple concurrent users
  • [MAHOUT-979] – RowSimilarityJob should be able to infer the number of columns from the input matrix if not specified
  • [MAHOUT-980] – Patch to make PFPGrowth run on Amazon MapReduce (also shows possible pattern to make other algorithms work in Amazon MapReduce)
  • [MAHOUT-986] – OutOfMemoryError in LanczosState by way of SpectralKMeans
  • [MAHOUT-987] – Our build is unstable – this should reduce our style warnings by >200
  • [MAHOUT-1001] – Performance improvement in recommenditembased
  • [MAHOUT-1009] – Remove old LDA implementation from codebase
  • [MAHOUT-1013] – The tests for org.apache.mahout.math.stats.entropy should not write to the project home but to a temp directory
  • [MAHOUT-1027] – Change to latest lucene version
  • [MAHOUT-1050] – mutable in-memory datamodel

New Feature

  • [MAHOUT-737] – Implicit Alternating Least Squares SVD
  • [MAHOUT-817] – Add PCA options to SSVD code


  • [MAHOUT-1008] – Remove link analysis package
  • [MAHOUT-1010] – Remove the old naive bayes implementation (org.apache.mahout.classifier.bayes) from the codebase
  • [MAHOUT-1012] – Remove watchmaker from codebase