AdaBoost [Freund-Schapire, 1997] is one of the best off-the-shelf supervised classification methods developed in the last fifteen years. Despite (or perhaps due to?) its simplicity and versatility, it is suprisingly under-represented in the family of open softwares. The goal of this submission is to fill this gap.
Our implementation is based on the AdaBoost.MH algorithm [Schapire-Singer, 1999]. It is an intrinsically multi-class classification method (unlike SVM for example), and it was easy to extend to multi-label or multi-task classification (when one item can belong to several classes). The program package can be divided into four modules that can be changed more-or-less independently depending on the application.
The strong learner. It tells you how to boost. The main boosting engine is AdaBoost.MH, but we have also implemented FilterBoost for a research project, and Arc-GV which is basically straightforward once you have the main engine. Other possible strong learners could be LogitBoost and ADTrees.
The base (or weak) learner. It tells you what features to boost. Right now we have two basic (feature-wise) base learners: decision stumps for real-valued features and indicators for nominal features. We have two meta base learners: trees and products. They can use any base learner and construct a generic complex base learner using a "classic" tree-structure (decision trees), or using the product of simple base learners (self advertisement: boosting products of stumps is the best reported no-domain-knowledge algorithm on MNIST after Hinton and Salakhutdinov's deep belief nets). We have also implemented Haar filters [Viola-Jones, 2004] for image classification, a meta base learner that uses stumps over a high dimensional feature space computed "on the fly". It is a nice example of a domain dependent base learner that works hand-in-hand with its appropriate data structure.
The data representation. The basic data structure is a matrix of observations with a vector of labels. We also have multi-label classification when the label data is also a full matrix. In addition, we have sparse data representation for both the observation matrix and the label matrix. In general, base learners are implemented to work with their own data representation (for example, sparse stumps work on sparse observation matrices, or Haar filters work on a integral image data representation).
Data parser. We can read in data in arff and svmlight formats.
The base learner/data structure combinations cover a large spectrum of possible applications, but the main advantage of the package is that it is easy (for the advanced user) to adapt MultiBoost to a specific (non-standard) application by implementing the base learner and data structure interfaces that work together.
The source code is available from the website multiboost.org. It can be compiled on Mac OS X, Linux, and Microsoft Windows. The interface is command line execution with switches.
- Changes to previous version:
- A new fast (sublinear in the number of instances) stump algorithm is implemented. The gain in time is proportional to the sparsity of the features (it is significant when a lot of instances take the most frequent feature value). See Section B.2 in the documentation.
- A parametrized early stopping option is added in --traintest mode. We stop if the (smoothed) test error does not improve for a certain number of iterations. See Section 4.1.3 in the documentation.
- BibTeX Entry: Download
- Corresponding Paper BibTeX Entry: Download
- URL: Project Homepage
- JMLR MLOSS PaperURL: JMLR-MLOSS Paper Homepage
- Supported Operating Systems: Linux, Macosx, Windows, Macos, Mac Os X
- Data Formats: Svmlight, Arff, Csv
- Tags: Large Scale Learning, Adaboost, Boosting, Multilabel Classification, Arff, Multiclass Classification, Icml2010
- Archive: download here
Other available revisons
Version Changelog Date 1.2.02
Major changes :
The “early stopping” feature can now based on any metric output with the --outputinfo command line argument.
Early stopping now works with --slowresume command line argument.
More informative output when testing.
Various compilation glitch with recent clang (OsX/Linux).
March 31, 2014, 16:13:04 1.2.01
Bug in csv parser corrected.
June 12, 2013, 22:34:04 1.2.00 April 22, 2013, 15:42:53 1.1.08
- streamlined output info file and consistent slow and fast resume behaviors
- library compilation and precompiled executables for various platforms
- sparse input format for both observations and labels, matching WEKA's sparse arff specifications
- noloop switch added for ProductLearner
February 6, 2013, 21:47:39 1.1.06
Breiman's Arc-GV added
November 5, 2012, 17:02:54 1.1.05 September 23, 2011, 15:24:29 1.1.02
Minor changes in the file headers.
August 15, 2011, 12:34:27 1.1.00
- Viola-Jones cascade and soft cascade strong learners are implemented.
- Output format is made more flexible.
- New improved Hamming tree base learner is implemented.
- Data filtering is streamlined.
July 28, 2011, 18:37:41 1.0.01
Small bug on Trees of Haar stumps corrected.
September 21, 2010, 16:04:57 1.0
Initial Announcement on mloss.org.
April 9, 2010, 13:30:41
No one has posted any comments yet. Perhaps you'd like to be the first?
Leave a comment
You must be logged in to post comments.