mlrl.testbed.data module

Author: Michael Rapp (michael.rapp.ml@gmail.com)

Provides functions for loading and saving data sets.

class mlrl.testbed.data.Feature(name: str, feature_type: FeatureType, nominal_values: List[str] | None = None)

Bases: object

Represents a numerical or nominal feature that is contained by a data set.

class mlrl.testbed.data.FeatureType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

All supported types of features.

NOMINAL = 3
NUMERICAL = 1
ORDINAL = 2
class mlrl.testbed.data.MetaData(features: List[Feature], outputs: List[Output], outputs_at_start: bool)

Bases: object

Stores the meta-data of a data set.

get_feature_indices(feature_types: Set[FeatureType] | None = None) List[int]

Returns a list that contains the indices of all features with one out of a given set of types (in ascending order).

Parameters:

feature_types – A set that contains the types of the features whose indices should be returned or None, if all indices should be returned

Returns:

A list that contains the indices of all features of the given types

get_num_features(feature_types: Set[FeatureType] | None = None) int

Returns the number of features with one out of a given set of types.

Parameters:

feature_types – A set that contains the types of the features to be counted or None, if all features should be counted

Returns:

The number of features of the given types

class mlrl.testbed.data.Output(name: str)

Bases: Feature

Represents an output that is contained by a data set.

mlrl.testbed.data.load_data_set(data_dir: str, arff_file_name: str, meta_data: ~mlrl.testbed.data.MetaData, feature_dtype=<class 'numpy.float32'>, output_dtype=<class 'numpy.uint8'>) Tuple[lil_array, lil_array]

Loads a data set from an ARFF file given its meta-data.

Parameters:
  • data_dir – The path of the directory that contains the ARFF file

  • arff_file_name – The name of the ARFF file (including the suffix)

  • meta_data – The meta-data

  • feature_dtype – The requested data type of the feature matrix

  • output_dtype – The requested data type of the output matrix

Returns:

A scipy.sparse.lil_array of type feature_dtype, shape (num_examples, num_features), representing the feature values of the examples, as well as a scipy.sparse.lil_array of type output_dtype, shape (num_examples, num_outputs), representing the corresponding ground truth

mlrl.testbed.data.load_data_set_and_meta_data(data_dir: str, arff_file_name: str, xml_file_name: str, feature_dtype=<class 'numpy.float32'>, output_dtype=<class 'numpy.uint8'>) Tuple[lil_array, lil_array, MetaData]

Loads a data set from an ARFF file and the corresponding Mulan XML file.

Parameters:
  • data_dir – The path of the directory that contains the files

  • arff_file_name – The name of the ARFF file (including the suffix)

  • xml_file_name – The name of the XML file (including the suffix)

  • feature_dtype – The requested type of the feature matrix

  • output_dtype – The requested type of the output matrix

Returns:

A scipy.sparse.lil_array of type feature_dtype, shape (num_examples, num_features), representing the feature values of the examples, a scipy.sparse.lil_array of type output_dtype, shape (num_examples, num_outputs), representing the corresponding ground truth, as well as the data set’s meta-data

mlrl.testbed.data.one_hot_encode(x, y, meta_data: MetaData, encoder=None)

One-hot encodes the nominal features contained in a data set, if any.

If the given feature matrix is sparse, it will be converted into a dense matrix. Also, an updated variant of the given meta-data, where the features have been removed, will be returned, as the original features become invalid by applying one-hot-encoding.

Parameters:
  • x – A np.ndarray, scipy.sparse.spmatrix or scipy.sparse.sparray, shape (num_examples, num_features), representing the features of the examples in the data set

  • y – A np.ndarray, scipy.sparse.spmatrix or scipy.sparse.sparray, shape (num_examples, num_outputs), representing the outputs of the examples in the data set

  • meta_data – The meta-data of the data set

  • encoder – The ‘ColumnTransformer’ to be used or None, if a new encoder should be created

Returns:

A np.ndarray, shape (num_examples, num_encoded_features), representing the encoded features of the given examples, the encoder that has been used, as well as the updated meta-data

mlrl.testbed.data.save_arff_file(output_dir: str, arff_file_name: str, x: ndarray, y: ndarray, meta_data: MetaData)

Saves a data set to an ARFF file.

Parameters:
  • output_dir – The path of the directory where the ARFF file should be saved

  • arff_file_name – The name of the ARFF file (including the suffix)

  • x – A np.ndarray, scipy.sparse.spmatrix or scipy.sparse.sparray, shape (num_examples, num_features), that stores the features of the examples that are contained in the data set

  • y – A np.ndarray, scipy.sparse.spmatrix or scipy.sparse.sparray, shape (num_examples, num_outputs), that stores the outputs of the examples that are contained in the data set

  • meta_data – The meta-data of the data set that should be saved

mlrl.testbed.data.save_data_set(output_dir: str, arff_file_name: str, x: ndarray, y: ndarray) MetaData

Saves a data set to an ARFF file. All features in the data set are considered to be numerical.

Parameters:
  • output_dir – The path of the directory where the ARFF file should be saved

  • arff_file_name – The name of the ARFF file (including the suffix)

  • x – A np.ndarray, scipy.sparse.spmatrix or scipy.sparse.sparray, shape (num_examples, num_features), that stores the features of the examples that are contained in the data set

  • y – A np.ndarray, scipy.sparse.spmatrix or scipy.sparse.sparray, shape (num_examples, num_outputs), that stores the outputs of the examples that are contained in the data set

Returns:

The meta-data of the data set that has been saved

mlrl.testbed.data.save_data_set_and_meta_data(output_dir: str, arff_file_name: str, xml_file_name: str, x: ndarray, y: ndarray) MetaData

Saves a data set to an ARFF file and its meta-data to an XML file. All features in the data set are considered to be numerical.

Parameters:
  • output_dir – The path of the directory where the ARFF file and the XML file should be saved

  • arff_file_name – The name of the ARFF file (including the suffix)

  • xml_file_name – The name of the XML file (including the suffix)

  • x – An array of type float, shape (num_examples, num_features), representing the features of the examples that are contained in the data set

  • y – An array of type float, shape (num_examples, num_outputs), representing the ground truth of the examples that are contained in the data set

Returns:

The meta-data of the data set that has been saved

mlrl.testbed.data.save_meta_data(output_dir: str, xml_file_name: str, meta_data: MetaData)

Saves the meta-data of a data set to an XML file.

Parameters:
  • output_dir – The path of the directory where the XML file should be saved

  • xml_file_name – The name of the XML file (including the suffix)

  • meta_data – The meta-data of the data set