BOOMER: Gradient Boosted Multi-Label Classification Rules BOOMER: Gradient Boosted Multi-Label Classification Rules

This software package provides the official implementation of BOOMER - an algorithm for learning gradient boosted multi-output rules that uses gradient boosting for learning an ensemble of rules that is built with respect to a specific multivariate loss function. It integrates with the popular scikit-learn machine learning framework.

The problem domains addressed by this algorithm include the following:

  • Multi-label classification: The goal of multi-label classification is the automatic assignment of sets of labels to individual data points, for example, the annotation of text documents with topics.

  • Multi-output regression: Multivariate regression problems require to predict for more than a single numerical output variable.

The BOOMER Algorithm

To provide a versatile tool for different use cases, great emphasis is put on the efficiency of the implementation. Moreover, to ensure its flexibility, it is designed in a modular fashion and can therefore easily be adjusted to different requirements. This modular approach enables implementing different kind of rule learning algorithms (see packages mlrl-common and mlrl-seco).

📖 References

The algorithm was first published in the following paper. A preprint version is publicly available here.

Michael Rapp, Eneldo Loza Mencía, Johannes Fürnkranz Vu-Linh Nguyen and Eyke Hüllermeier. Learning Gradient Boosted Multi-label Classification Rules. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), 2020, Springer.

If you use the algorithm in a scientific publication, we would appreciate citations to the mentioned paper.

🔧 Functionalities

The algorithm that is provided by this project currently supports the following core functionalities for learning ensembles of boosted classification or regression rules.

Deliberate Loss Optimization

  • Decomposable or non-decomposable loss functions can be optimized in expectation.

  • L1 and L2 regularization can be used.

  • Shrinkage (a.k.a. the learning rate) can be adjusted for controlling the impact of individual rules on the overall ensemble.

Different Prediction Strategies

  • Various strategies for predicting scores, binary labels or probabilities are available, depending on whether a classification or regression model is used.

  • Isotonic regression models can be used to calibrate marginal and joint probabilities predicted by a classification model.

Flexible Handling of Input Data

  • Native support for numerical, ordinal, and nominal features eliminates the need for pre-processing techniques such as one-hot encoding.

  • Handling of missing feature values, i.e., occurrences of NaN in the feature matrix, is implemented by the algorithm.

Fine-grained Control over Model Characteristics

  • Rules can be constructed via a greedy search or a beam search. The latter may help to improve the quality of individual rules.

  • Single-output, partial, or complete heads can be used by rules, i.e., they can predict for a single output, a subset of the available outputs, or all of them. Predicting for multiple outputs simultaneously enables to model local dependencies between them.

  • Fine-grained control over the specificity/generality of rules is provided via hyperparameters.

Support for Post-Optimization and Pruning

  • Incremental reduced error pruning can be used for removing overly specific conditions from rules and preventing overfitting.

  • Post- and pre-pruning (a.k.a. early stopping) allows to determine the optimal number of rules to be included in an ensemble.

  • Sequential post-optimization may help improving the predictive performance of a model by reconstructing each rule in the context of the other rules.

⌚ Runtime and Memory Optimizations

In addition to the features mentioned above, several techniques that may speed up training or reduce the memory footprint are currently implemented.

Approximation Techniques

  • Unsupervised feature binning can be used to speed up the evaluation of a rule’s potential conditions when dealing with numerical features.

  • Sampling techniques and stratification methods can be used for learning new rules on a subset of the available training examples, features, or output variables.

  • Gradient-based label binning (GBLB) can be used for assigning the labels included in a multi-label classification dataset to a limited number of bins. This may speed up training significantly when minimizing a non-decomposable loss function using rules with partial or complete heads.

Sparse Data Structures

  • Sparse feature matrices can be used for training and prediction. This may speed up training significantly on some datasets.

  • Sparse ground truth matrices can be used for training. This may reduce the memory footprint in case of large datasets.

  • Sparse prediction matrices can be used for storing predicted labels. This may reduce the memory footprint in case of large datasets.

  • Sparse matrices for storing gradients and Hessians can be used if supported by the loss function. This may speed up training significantly on datasets with many output variables.

Parallelization

  • Multi-threading can be used for parallelizing the evaluation of a rule’s potential refinements across several features, updating the gradients and Hessians of individual examples in parallel, or obtaining predictions for several examples in parallel.

📚 Documentation

This documentation discusses the following topics:

Moreover, we provide Python and C++ API references for developers.