mlrl.testbed.data_splitting module¶

Author: Michael Rapp (michael.rapp.ml@gmail.com)

Provides classes for training and evaluating multi-label classifiers using either cross validation or separate training and test sets.

class mlrl.testbed.data_splitting.CrossValidationFold(num_folds: int, fold: int, current_fold: int)¶

Bases: DataSplit

Provides information about a split of the available data that is used by a single fold of a cross validation.

get_fold() → int | None¶

Returns the cross validation fold, this split corresponds to.

Returns:: The cross validation fold, starting at 0, or None, if no cross validation is used

get_num_folds() → int¶

Returns the total number of cross validation folds.

Returns:: The total number of cross validation folds or 1, if no cross validation is used

is_last_fold() → bool¶

Returns whether this split corresponds to the last fold of a cross validation or not.

Returns:: True, if this split corresponds to the last fold, False otherwise

is_train_test_separated() → bool¶

Returns whether the training data is separated from the test data or not.

Returns:: True, if the training data is separated from the test data, False otherwise

class mlrl.testbed.data_splitting.CrossValidationOverall(num_folds: int)¶

Bases: DataSplit

Provides information about the overall splits of a cross validation.

get_fold() → int | None¶

Returns the cross validation fold, this split corresponds to.

Returns:: The cross validation fold, starting at 0, or None, if no cross validation is used

get_num_folds() → int¶

Returns the total number of cross validation folds.

Returns:: The total number of cross validation folds or 1, if no cross validation is used

is_last_fold() → bool¶

Returns whether this split corresponds to the last fold of a cross validation or not.

Returns:: True, if this split corresponds to the last fold, False otherwise

is_train_test_separated() → bool¶

Returns whether the training data is separated from the test data or not.

Returns:: True, if the training data is separated from the test data, False otherwise

class mlrl.testbed.data_splitting.CrossValidationSplitter(data_set: DataSet, num_folds: int, current_fold: int, random_state: int)¶

Bases: DataSplitter

Splits the available data into training and test sets corresponding to the individual folds of a cross validation.

class mlrl.testbed.data_splitting.DataSet(data_dir: str, data_set_name: str, use_one_hot_encoding: bool)¶

Bases: object

Stores the properties of a data set to be used for training and evaluating multi-label classifiers.

class mlrl.testbed.data_splitting.DataSplit¶

Bases: ABC

Provides information about a split of the available data that is used for training and testing.

abstract get_fold() → int | None¶

Returns the cross validation fold, this split corresponds to.

Returns:: The cross validation fold, starting at 0, or None, if no cross validation is used

abstract get_num_folds() → int¶

Returns the total number of cross validation folds.

Returns:: The total number of cross validation folds or 1, if no cross validation is used

is_cross_validation_used() → bool¶

Returns whether cross validation is used or not.

Returns:: True, if cross validation is used, False otherwise

abstract is_last_fold() → bool¶

Returns whether this split corresponds to the last fold of a cross validation or not.

Returns:: True, if this split corresponds to the last fold, False otherwise

abstract is_train_test_separated() → bool¶

Returns whether the training data is separated from the test data or not.

Returns:: True, if the training data is separated from the test data, False otherwise

class mlrl.testbed.data_splitting.DataSplitter¶

Bases: ABC

An abstract base class for all classes that split a data set into training and test data.

class Callback¶

Bases: ABC

An abstract base class for all classes that train and evaluate a model given a predefined split of the available data.

abstract train_and_evaluate(meta_data: MetaData, data_split: DataSplit, train_x, train_y, test_x, test_y)¶

The function that is invoked to train a model on a training set and evaluate it on a test set.

Parameters:

meta_data – The meta-data of the training data set
data_split – Information about the split of the available data that should be used for training and evaluating the model
train_x – The feature matrix of the training examples
train_y – The label matrix of the training examples
test_x – The feature matrix of the test examples
test_y – The label matrix of the test examples

run(callback: Callback)¶

Parameters:: callback – The callback that should be used for training and evaluating models

class mlrl.testbed.data_splitting.DataType(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)¶

Bases: Enum

Characterizes data as either training or test data.

TEST = 'test'¶

TRAINING = 'training'¶

get_file_name(name: str) → str¶

Returns a file name that corresponds to a specific type of data.

Parameters:: name – The name of the file (without suffix)
Returns:: The file name

class mlrl.testbed.data_splitting.NoSplit¶

Bases: DataSplit

Provides information about data that has not been split into separate training and test data.

get_fold() → int | None¶

Returns the cross validation fold, this split corresponds to.

Returns:: The cross validation fold, starting at 0, or None, if no cross validation is used

get_num_folds() → int¶

Returns the total number of cross validation folds.

Returns:: The total number of cross validation folds or 1, if no cross validation is used

is_last_fold() → bool¶

Returns whether this split corresponds to the last fold of a cross validation or not.

Returns:: True, if this split corresponds to the last fold, False otherwise

is_train_test_separated() → bool¶

Returns whether the training data is separated from the test data or not.

Returns:: True, if the training data is separated from the test data, False otherwise

class mlrl.testbed.data_splitting.NoSplitter(data_set: DataSet)¶

Bases: DataSplitter

Does not split the available data into separate train and test sets.

class mlrl.testbed.data_splitting.TrainTestSplitter(data_set: DataSet, test_size: float, random_state: int)¶

Bases: DataSplitter

Splits the available data into a single train and test set.

class mlrl.testbed.data_splitting.TrainingTestSplit¶

Bases: DataSplit

Provides information about a split of the available data into training and test data.

get_fold() → int | None¶

Returns the cross validation fold, this split corresponds to.

Returns:: The cross validation fold, starting at 0, or None, if no cross validation is used

get_num_folds() → int¶

Returns the total number of cross validation folds.

Returns:: The total number of cross validation folds or 1, if no cross validation is used

is_last_fold() → bool¶

Returns whether this split corresponds to the last fold of a cross validation or not.

Returns:: True, if this split corresponds to the last fold, False otherwise

is_train_test_separated() → bool¶

Returns whether the training data is separated from the test data or not.

Returns:: True, if the training data is separated from the test data, False otherwise

mlrl.testbed.data_splitting.check_if_files_exist(directory: str, file_names: List[str]) → bool¶

Returns whether all given files exist or not. If some of the files are missing, an IOError is raised.

Parameters:

directory – The path to the directory where the files should be located
file_names – A list that contains the names of all files to be checked

Returns:

True, if all files exist, False, if all files are missing