Supported Dataset Formats¶
The package mlrl-testbed is build in a modular fashion. This means that extensions can be used to extend its functionality, including the support for different dataset formats. In the following, the dataset formats supported by these extensions are discussed in detail.
We provide a curated list of supported datasets in this repository. The datasets included in this repository originate from the following sources:
The MEKA project
The MULAN project
The LIBSVM project
The MLDA tool for analyzing multi-label datasets
LIBSVM Format¶
The LIBSVM dataset format was popularized by the LIBSVM library and is supported by many machine learning frameworks. Support for this particular dataset format is brought to mlrl-testbed via the package mlrl-testbed-arff and is implemented via scikit-learn’s sklearn.datasets.load_svmlight_file(). It is a simple plain text format aimed at efficiently storing sparse datasets without unnecessary overhead. Due to its simplicity, it comes with some limitations. For example, unlike the ARFF format, it does not include any meta-data, such as attribute types or names. For this reason, all features are always interpreted as numerical ones.
Each line in an SVM file corresponds to a particular training or test example. For example, in case of a (multi-label) classification dataset, it can look like this:
<label-index1>,<label-index2> <feature-index1>:<feature-value1> <feature-index2>:<feature-value2>
The line starts wit a comma-separated list of indices. These indices are the indices of the labels that are relevant to the example. Then, a space-separated list of tuples follows. Each tuple corresponds to a certain feature, identified via an index, and a corresponding value. As SVM files are supposed to encode sparse datasets, irrelevant labels and feature values that are equal to zero are not explicitly stored in the file.
In case of a regression dataset, a single line could look like this:
<score1>,<score2> <feature-index1>:<feature-value1> <feature-index2>:<feature-value2>
Whereas the features are encoded in the same way as before, the comma-separated list at the start denotes the regression scores corresponding to the example. One score must be given for each available output, i.e., scores that are equal to zero cannot be omitted here.
Attribute-Relation File Format (ARFF)¶
The [Attribute-Relation File Format] has been proposed by researchers from the University of Waikato, New Zealand. It is used by the WEKA machine learning software developed by the same people. Support for this file format is brought to mlrl-testbed by the package mlrl-testbed-arff.
Note
Currently, the package mlrl-testbed-arff is a hard dependency of mlrl-testbed and is therefore installed alongside it automatically. In the future, this behavior might change and the dependency might become optional.
Mulan Format¶
By default, mlrl-testbed checks if the dataset files are present in the variant used by the Mulan project. It requires two files to be present in a given directory:
An ARFF file that specifies the feature values and ground truth of the training examples.
An XML file that specifies the names of the outputs.
For example, the ARFF file could look like this:
@relation MultiLabelExample
@attribute feature_1 numeric
@attribute feature_2 numeric
@attribute feature_3 numeric
@attribute label_1 {0, 1}
@attribute label_2 {0, 1}
@attribute label_3 {0, 1}
@attribute label_4 {0, 1}
@attribute label_5 {0, 1}
The XML file corresponding to the ARFF file above would look like this:
<labels xmlns="http://mulan.sourceforge.net/labels">
<label name="label_1"/>
<label name="label_2"/>
<label name="label_3"/>
<label name="label_4"/>
<label name="label_5"/>
</labels>
In contrast to the MEKA format discussed below, the Mulan format allows to treat any attribute in an ARFF file as an output and does not require them to be located at the start or end.
MEKA Format¶
If an XML file is not provided, the program tries to parse the number of outputs from the @relation declaration that is contained in the ARFF file, as it is intended by the MEKA project’s dataset format. According to this format, the number of outputs must be specified by including the substring “-C L” in the @relation name, where “L” is the number of leading features in the dataset that should be treated as outputs.