mlrl.testbed.data module

Author: Michael Rapp (michael.rapp.ml@gmail.com)

Provides functions for handling multi-label data.

class mlrl.testbed.data.Attribute(attribute_name: str, attribute_type: AttributeType, nominal_values: List[str] | None = None)

Bases: object

Represents a numerical or nominal attribute that is contained by a data set.

class mlrl.testbed.data.AttributeType(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

All supported types of attributes.

NOMINAL = 3
NUMERICAL = 1
ORDINAL = 2
class mlrl.testbed.data.Label(name: str)

Bases: Attribute

Represents a label that is contained by a data set.

class mlrl.testbed.data.MetaData(attributes: List[Attribute], labels: List[Attribute], labels_at_start: bool)

Bases: object

Stores the meta-data of a multi-label data set.

get_attribute_indices(attribute_types: Set[AttributeType] | None = None) List[int]

Returns a list that contains the indices of all attributes with one out of a given set of types (in ascending order).

Parameters:

attribute_types – A set that contains the types of the attributes whose indices should be returned or None, if all indices should be returned

Returns:

A list that contains the indices of all attributes of the given types

get_num_attributes(attribute_types: Set[AttributeType] | None = None) int

Returns the number of attributes with one out of a given set of types.

Parameters:

attribute_types – A set that contains the types of the attributes to be counted or None, if all attributes should be counted

Returns:

The number of attributes of the given types

mlrl.testbed.data.load_data_set(data_dir: str, arff_file_name: str, meta_data: ~mlrl.testbed.data.MetaData, feature_dtype=<class 'numpy.float32'>, label_dtype=<class 'numpy.uint8'>) Tuple[lil_matrix, lil_matrix]

Loads a multi-label data set from an ARFF file given its meta-data.

Parameters:
  • data_dir – The path of the directory that contains the ARFF file

  • arff_file_name – The name of the ARFF file (including the suffix)

  • meta_data – The meta-data

  • feature_dtype – The requested data type of the feature matrix

  • label_dtype – The requested data type of the label matrix

Returns:

A scipy.sparse.lil_matrix of type feature_dtype, shape (num_examples, num_features), representing the feature values of the examples, as well as a scipy.sparse.lil_matrix of type label_dtype, shape (num_examples, num_labels), representing the corresponding label vectors

mlrl.testbed.data.load_data_set_and_meta_data(data_dir: str, arff_file_name: str, xml_file_name: str, feature_dtype=<class 'numpy.float32'>, label_dtype=<class 'numpy.uint8'>) Tuple[lil_matrix, lil_matrix, MetaData]

Loads a multi-label data set from an ARFF file and the corresponding Mulan XML file.

Parameters:
  • data_dir – The path of the directory that contains the files

  • arff_file_name – The name of the ARFF file (including the suffix)

  • xml_file_name – The name of the XML file (including the suffix)

  • feature_dtype – The requested type of the feature matrix

  • label_dtype – The requested type of the label matrix

Returns:

A scipy.sparse.lil_matrix of type feature_dtype, shape (num_examples, num_features), representing the feature values of the examples, a scipy.sparse.lil_matrix of type label_dtype, shape (num_examples, num_labels), representing the corresponding label vectors, as well as the data set’s meta-data

mlrl.testbed.data.one_hot_encode(x, y, meta_data: MetaData, encoder=None)

One-hot encodes the nominal attributes contained in a data set, if any.

If the given feature matrix is sparse, it will be converted into a dense matrix. Also, an updated variant of the given meta-data, where the attributes have been removed, will be returned, as the original attributes become invalid by applying one-hot-encoding.

Parameters:
  • x – A np.ndarray or scipy.sparse.matrix, shape (num_examples, num_features), representing the features of the examples in the data set

  • y – A np.ndarray or scipy.sparse.matrix, shape (num_examples, num_labels), representing the labels of the examples in the data set

  • meta_data – The meta-data of the data set

  • encoder – The ‘ColumnTransformer’ to be used or None, if a new encoder should be created

Returns:

A np.ndarray, shape (num_examples, num_encoded_features), representing the encoded features of the given examples, the encoder that has been used, as well as the updated meta-data

mlrl.testbed.data.save_arff_file(output_dir: str, arff_file_name: str, x: ndarray, y: ndarray, meta_data: MetaData)

Saves a multi-label data set to an ARFF file.

Parameters:
  • output_dir – The path of the directory where the ARFF file should be saved

  • arff_file_name – The name of the ARFF file (including the suffix)

  • x – A np.ndarray or scipy.sparse matrix, shape (num_examples, num_features), that stores the features of the examples that are contained in the data set

  • y – A np.ndarray or scipy.sparse matrix, shape (num_examples, num_labels), that stores the labels of the examples that are contained in the data set

  • meta_data – The meta-data of the data set that should be saved

mlrl.testbed.data.save_data_set(output_dir: str, arff_file_name: str, x: ndarray, y: ndarray) MetaData

Saves a multi-label data set to an ARFF file. All attributes in the data set are considered to be numerical.

Parameters:
  • output_dir – The path of the directory where the ARFF file should be saved

  • arff_file_name – The name of the ARFF file (including the suffix)

  • x – A np.ndarray or scipy.sparse matrix, shape (num_examples, num_features), that stores the features of the examples that are contained in the data set

  • y – A np.ndarray or scipy.sparse matrix, shape (num_examples, num_labels), that stores the labels of the examples that are contained in the data set

Returns:

The meta-data of the data set that has been saved

mlrl.testbed.data.save_data_set_and_meta_data(output_dir: str, arff_file_name: str, xml_file_name: str, x: ndarray, y: ndarray) MetaData

Saves a multi-label data set to an ARFF file and its meta-data to an XML file. All attributes in the data set are considered to be numerical.

Parameters:
  • output_dir – The path of the directory where the ARFF file and the XML file should be saved

  • arff_file_name – The name of the ARFF file (including the suffix)

  • xml_file_name – The name of the XML file (including the suffix)

  • x – An array of type float, shape (num_examples, num_features), representing the features of the examples that are contained in the data set

  • y – An array of type float, shape (num_examples, num_labels), representing the label vectors of the examples that are contained in the data set

Returns:

The meta-data of the data set that has been saved

mlrl.testbed.data.save_meta_data(output_dir: str, xml_file_name: str, meta_data: MetaData)

Saves the meta-data of a multi-label data set to an XML file.

Parameters:
  • output_dir – The path of the directory where the XML file should be saved

  • xml_file_name – The name of the XML file (including the suffix)

  • meta_data – The meta-data of the data set