Quickstart
Building the Project
The algorithm provided by this project is mostly implemented in C++. In addition, a Python wrapper that implements the scikit-learn API is provided. To integrate the underlying C++ implementation with Python, Cython is used.
Unlike pure Python programs, the code written in C++ and Cython must be compiled to be able to run the algorithm. To facilitate the compilation, the project comes with a Makefile that automatically executes the necessary steps.
As a prerequisite, Python 3.7 (or a more recent version) must be available on the host system. All remaining compile- or build-time dependencies will automatically installed when following the instructions below.
Note
We only support x86_64 Linux platforms out-of-the-box, although compilation should be possible on Windows and MacOS systems as well. Unfortunately, we currently do not have the resources to provide support for these platforms. For future releases we plan to distribute prebuilt packages for all major platforms.
Step 1: Create a virtual environment
At first, a virtual Python environment can be created via the following command:
make venv
All compile-time dependencies (numpy, scipy, Cython, meson and ninja) that are required for building the project should automatically be installed into the virtual environment when executing the above command. As a result, a subdirectory “venv” should have been created in the project’s root directory.
Step 2: Compilation
Afterwards, the compilation can be started by executing the following command:
make compile
Compilation is based on the build system Meson and uses Ninja as a backend.
Whenever any C++ or Cython source files have been modified, they must be recompiled by running the above command again. If any compilation files do already exist, only the affected parts of the code will be recompiled.
Step 3: Installation
Once the compilation has completed, the library can be installed into the virtual environment. For this purpose, the project’s Makefile provides the following command:
make install
The above command does also install all runtime dependencies, such as scikit-learn. A full list of all dependencies can be found in the file “python/setup.py”.
Step 4: Generating the Documentation (Optional)
In order to generate the documentation (this document), Doxygen must be installed on the system beforehand. It is used to automatically generate an API documentation from the source code. By running the following command, the documentation’s HTML documents are generated:
make doc
Afterwards, the generated files can be found in the directory doc/build_/html.
Cleanup
To get rid of any compilation files, the generated documentation files, as well as of the virtual environment, the following command can be used:
make clean
For more fine-grained control, the command make clean_venv can be used for deleting the virtual environment. The command make clean_compile does only delete the compilation files. If only the compiled Cython files should be removed, the command make clean_cython can be used. Accordingly, the command make clean_cpp removes the compiled C++ files. To delete the generated documentation files, the command make clean_doc may be used.
Running the Algorithm
The Python script python/main_boomer.py allows to run experiments on a specific data set using different configurations of the learning algorithm. Besides the training and evaluation of models, the script does also allow to write experimental results into an output directory. Furthermore, the learned models can be stored on disk for later use.
In the following, an example of how the script can be executed is shown:
venv/bin/python3 python/main_boomer.py --data-dir /path/to/data --output-dir /path/to/results/emotions --model-dir /path/to/models/emotions --dataset emotions --folds 10 --max-rules 1000 --instance-sampling with-replacement --feature-sampling without-replacement --loss logistic-label-wise --shrinkage 0.3 --pruning None --head-type single-label
Parameters
The behavior of the BOOMER algorithm can be controlled in a fine-grained manner via a large number of parameters. Most of these parameters are optional. If not specified otherwise, default settings that work well in most of the cases are used. In the following, an overview of all available parameters, as well as their default values, is provided.
Note
Each parameter is identified by an unique name and must be specified according to the following syntax:
--parameter-name value
In addition to the specified value, some parameters allow to specify additional options as key-value pairs. These options may be provided by using the following bracket notation:
--parameter-name value{key1=value1,key2=value2}
Parameter values that include additional options may not contain any spaces. Depending on the shell that is used to run the program, special characters like { or } must eventually be escaped. When using bash or sh this can be achieved by adding single quotes as follows:
--parameter-name value'{key1=value1,key2=value2}'.
Data set
The following parameters are always needed to specify the data set that should be used for training:
--data-dirThe path of the directory where the data set files are located (an ARFF file and a corresponding XML file according to the Mulan format).
--datasetThe name of the data set files (without suffix).
Training/Testing Procedure
--folds(Default value = 1)The total number of folds to be used for cross validation. Must be greater than 1 or 1, if no cross validation should be used.
--current-fold(Default value = 0)The cross validation fold to be performed. Must be in [1, –folds] or 0, if all folds should be performed. This parameter is ignored if –folds is set to 1.
--evaluate-training-data(Default value = false)trueThe models are not only evaluated on the test data, but also on the training data.falseThe models are only evaluated on the test data.
Data Format
The following parameters allow to specify how the training data should be organized:
--one-hot-encoding(Default value = false)trueOne-hot-encoding is used to encode nominal attributes.falseThe algorithm’s ability to natively handle nominal attributes is used.
--feature-format(Default value = auto)autoThe most suitable format for representation of the feature matrix is chosen automatically by estimsting which representation requires less memory.denseEnforces that the feature matrix is stored using a dense format.sparseEnforces that the feature matrix is stored using a sparse format. Using a sparse format may reduce the memory footprint and/or speed up the training process on some data sets.
--label-format(Default value = auto)autoThe most suitable format for representation of the label matrix is chosen automatically by estimating which representation requires less memory.denseEnforces that the label matrix is stored using a dense format.sparseEnforces that the label matrix is stored using a sparse format. Using a sparse format may reduce the memory footprint on some data sets.
--prediction-format(Default value = auto)autoThe most suitable format for representation of predicted labels is chosen automatically based on the sparsity of the ground truth labels supplied for training.denseEnforces that predicted labels are stored using a dense format.sparseEnforces that predicted labels are stored using a sparse format. Using a sparse format may reduce the memory footprint on some data sets.
Input Files
The following parameters allow to specify the directories, where input files can be found:
--model-dir(Default value = None)The path of the directory where models should be stored. If such models are found in the specified directory, they will be used instead of training from scratch. If no models are available, the trained models will be saved in the specified directory once training has completed.
--parameter-dir(Default value = None)The path of the directory where configuration files, which specify the parameters to be used by the algorithm, are located. If such files are found in the specified directory, the specified parameter settings are used instead of the parameters that are provided via command line parameters.
Output
The following parameters allow to customize the console output and output files that are written by the algorithm:
--output-dir(Default value = None)The path of the directory where experimental results should be saved.
--store-predictions(Default value = false)trueThe predictions for individual examples and labels are written into output files. Does only have an effect if the parameter –output-dir is specified.falsePredictions are not written into output files.
--print-rules(Default value = false)trueThe induced rules are printed on the console.falseThe induced rules are not printed on the console.
--store-rules(Default value = false)trueThe induced rules are written into a text file. Does only have an effect if the parameter –output-dir is specified.falseThe induced rules are not written into a text file.
--print-options(Default value = None)Additional options to be taken into account when writing rules on the console or into an output file. Does only have an effect, if the parameter –print-rules or –store-rules is set to
true. The options must be given using the bracket notation. The following options are available:print_feature_names(Default value = true)true, if the names of features should be printed instead of their indices,falseotherwise.print_label_names(Default value = true)true, if the names of labels should be printed instead of their indices,falseotherwise.print_nominal_values(Default value = true)true, if the names of nominal values should be printed instead of their numerical representation,falseotherwise.
--log-level(Default value = info)The log level to be used. Must be debug, info, warn, warning, error, critical, fatal or notset.
Algorithmic Parameters
The following parameters allow to adjust the behavior of the algorithm:
--random-state(Default value = 1)The seed to be used by random number generators. Must be at least 1.
--max-rules(Default value = 1000)The maximum number of rules to be induced. Must be at least 1 or 0, if the number of rules should not be restricted.
--default-rule(Default value = true)trueThe first rule is a default rule.falseNo default rule is used.
--time-limit(Default value = 0)The duration in seconds after which the induction of rules should be canceled. Must be at least 1 or 0, if no time limit should be set.
--label-sampling(Default value = None)NoneAll labels are considered for learning a new rule.without-replacementThe labels to be considered when learning a new rule are chosen randomly. The following options may be provided using the bracket notation:num_samples(Default value = 1) The number of labels the be included in a sample. Must be at least 1.
--feature-sampling(Default value = without-replacement)NoneAll features are considered for learning a new rule.without-replacementA random subset of the features is used to search for the refinements of rules. The following options may be provided using the bracket notation:sample_size(Default value = 0) The percentage of features to be included in a sample, e.g., a value of 0.6 corresponds to 60% of the features. Must be in (0, 1] or 0, if the sample size should be calculated as log2(numFeatures - 1) + 1).
--instance-sampling(Default value = None)NoneAll training examples are considered for learning a new rule.with-replacementThe training examples to be considered for learning a new rule are selected randomly with replacement. The following options may be provided using the bracket notation:sample_size(Default value = 1.0) The percentage of examples to be included in a sample, e.g., a value of 0.6 corresponds to 60% of the available examples. Must be in (0, 1).
without-replacementThe training examples to be considered for learning a new rule are selected randomly without replacement. The following options may be provided using the bracket notation:sample_size(Default value = 0.66) The percentage of examples to be included in a sample, e.g., a value of 0.6 corresponds to 60% of the available examples. Must be in (0, 1).
stratified-label-wiseThe training examples to be considered for learning a new rule are selected according to an iterative stratified sampling method that ensures that for each label the proportion of relevant and irrelevant examples is maintained. The following options may be provided using the bracket notation:sample_size(Default value = 0.66) The percentage of examples to be included in a sample, e.g., a value of 0.6 corresponds to 60% of the available examples. Must be in (0, 1).
stratified-example-wiseThe training examples to be considered for learning a new rule are selected according to stratified sampling method, where distinct label vectors are treated as individual classes. The following options may be provided using the bracket notation:sample_size(Default value = 0.66) The percentage of examples to be included in a sample, e.g., a value of 0.6 corresponds to 60% of the available examples. Must be in (0, 1).
--recalculate-predictions(Default value = true)trueThe predictions of rules are recalculated on the entire training data, if the parameter –instance-sampling is not set to None.falseThe predictions of rules are not recalculated.
--holdout(Default value = None)NoneNo holdout set is created.randomThe available examples are randomly split into a training set and a holdout set. The following options may be provided using the bracket notation:holdout_set_size(Default value = 0.33) The percentage of examples to be included in the holdout set, e.g., a value of 0.3 corresponds to 30% of the available examples. Must be in (0, 1).
stratified-label-wiseThe available examples are split into a training set and a holdout set according to an iterative stratified sampling method that ensures that for each label the proportion of relevant and irrelevant examples is maintained. The following options may be provided using the bracket notation:holdout_set_size(Default value = 0.33) The percentage of examples to be included in the holdout set, e.g., a value of 0.3 corresponds to 30% of the available examples. Must be in (0, 1).
stratified-example-wiseThe available examples are split into a training set and a holdout set according to a stratified sampling method, where distinct label vectors are treated as individual classes. The following options may be provided using the bracket notation:holdout_set_size(Default value = 0.33) The percentage of examples to be included in the holdout set, e.g., a value of 0.3 corresponds to 30% of the available examples. Must be in (0, 1).
--early-stopping(Default value = None)NoneNo strategy for early-stopping is used.lossStops the induction of new rules as soon as the performance of the model does not improve on a holdout set, according to the loss function. This parameter does only have an effect if the parameter –holdout is set to a value greater than 0. The following options may be provided using the bracket notation:min_rules(Default value = 100) The minimum number of rules. Must be at least 1.update_interval(Default value = 1) The interval to be used to update the quality of the current model, e.g., a value of 5 means that the model quality is assessed every 5 rules. Must be at least 1.stop_interval(Default value = 1) The interval to be used to decide whether the induction of rules should be stopped, e.g., a value of 10 means that the rule induction might be stopped after 10, 20, … rules. Must be a multiple of update_interval.num_past(Default value = 50) The number of quality scores of past iterations to be stored in a buffer. Must be at least 1.num_recent(Default value = 50) The number of quality scores of the most recent iterations to be stored in a buffer. Must be at least 1.aggregation(Default value = min) The name of the aggregation function that should be used to aggregate the scores in both buffers. Must be min, max or avg.min_improvement(Default value = 0.005) The minimum improvement in percent that must be reached when comparing the aggregated scores in both buffers for the rule induction to be continued. Must be in [0, 1].force_stop(Default value =true)true, if the induction of rules should be forced to be stopped, if the stopping criterion is met,false, if the time of stopping should only be stored.
--feature-binning(Default value = None)NoneNo feature binning is used.equal-widthExamples are assigned to bins, based on their feature values, according to the equal-width binning method. The following options may be provided using the bracket notation:bin_ratio(Default value = 0.33) A percentage that specifies how many bins should be used, e.g., a value of 0.3 means that the number of bins should be set to 30% of the number of distinct values for a feature.min_bins(Default value = 2) The minimum number of bins to be used. Must be at least 2.max_bins(Default value = 0) The maximum number of bins to be used. Must be at least min_bins or 0, if the number of bins should not be restricted.
equal-frequency. Examples are assigned to bins, based on their feature values, according to the equal-frequency binning method. The following options may be provided using the bracket notation:bin_ratio(Default value = 0.33) A percentage that specifies how many bins should be used, e.g., a value of 0.3 means that the number of bins should be set to 30% of the number of distinct values for a feature.min_bins(Default value = 2) The minimum number of bins to be used. Must be at least 2.max_bins(Default value = 0) The maximum number of bins to be used. Must be at least min_bins or 0, if the number of bins should not be restricted.
--label-binning(Default Value = auto)NoneNo label binning is used.autoThe most suitable strategy for label-binning is chosen automatically based on the loss function and the type of rule heads.equal-widthThe labels for which a rule may predict are assigned to bins according to the equal-width binning method. The following options may be provided using the bracket notation:bin_ratio(Default value = 0.04) A percentage that specifies how many bins should be used, e.g., a value of 0.04 means that number of bins should be set to 4% of the number of labels.min_bins(Default value = 1) The minimum number of bins to be used. Must be at least 1.max_bins(Default value = 0) The maximum number of bins to be used or 0, if the number of bins should not be restricted.
--pruning(Default value = None)NoneNo pruning is used.irep. Subsequent conditions of rules may be pruned on a holdout set, similar to the IREP algorithm. Does only have an effect if the parameter –instance-sampling is not set to None.
--min-coverage(Default value = 1)The minimum number of training examples that must be covered by a rule. Must be at least 1.
--max-conditions(Default value = 0)The maximum number of conditions to be included in a rule’s body. Must be at least 1 or 0, if the number of conditions should not be restricted.
--max-head-refinements(Default value = 1)The maximum number of times the head of a rule may be refined. Must be at least 1 or 0, if the number of refinements should not be restricted.
--head-type(Default value = auto)autoThe most suitable type of rule heads is chosen automatically based on the loss function.single-labelIf all rules should predict for a single label.completeIf all rules should predict for all labels simultaneously, potentially capturing dependencies between the labels.
--shrinkage(Default value = 0.3)The shrinkage parameter, a.k.a. the learning rate, to be used. Must be in (0, 1].
--loss(Default value = logistic-label-wise)logistic-label-wiseA variant of the logistic loss function that is applied to each label individually.logistic-example-wiseA variant of the logistic loss function that takes all labels into account at the same time.squared-error-label-wiseA variant of the Squared error loss that is applied to each label individually.hinge-label-wiseA variant of the Hinge loss that is applied to each label individually.
--predictor(Default value = auto)autoThe most suitable strategy for making predictions is chosen automatically, depending on the loss function.label-wiseThe prediction for an example is determined for each label independently.example-wiseThe label vector that is predicted for an example is chosen from the set of label vectors encountered in the training data.
--l2-regularization-weight(Default value = 1.0)The weight of the L2 regularization. Must be at least 0. If 0 is used, the L2 regularization is turned off entirely. Increasing the value causes the model to become more conservative.
Multithreading
The following parameters allow to enable multi-threading for different aspects of the algorithm:
--parallel-rule-refinement(Default value = auto)autoThe number of threads to be used to search for potential refinements of rules in parallel is chosen automatically, depending on the loss function.falseNo multi-threading is used to search for potential refinements of rules.trueMulti-threading is used to search for potential refinements of rules in parallel. The following options may be provided using the bracket notation:num_threads(Default value = 0) The number of threads to be used. Must be at least 1 or 0, if the number of cores available on the machine should be used.
--parallel-statistic-update(Default value = auto)autoThe number of threads to be used to calculate the gradients and Hessians for different examples in parallel is chosen automatically, depending on the loss function.falseNo multi-threading is used to calculate the gradients and Hessians of different examples.trueMulti-threading is used to calculate the gradients and Hessians of different examples in parallel. The following options may be provided using the bracket notation:num_threads(Default value = 0) The number of threads to be used. Must be at least 1 or 0, if the number of cores available on the machine should be used.
--parallel-prediction(Default value = true)falseNo multi-threading is used to obtain predictions for different examples.trueMulti-threading is used to obtain predictions for different examples in parallel. The following options may be provided using the bracket notation:num_threads(Default value = 0) The number of threads to be used. Must be at least 1 or 0, if the number of cores available on the machine should be used.