dft#
Source code: sensai/data_transformation/dft.py
- class DataFrameTransformer[source]#
Bases:
ABC,ToStringMixinBase class for data frame transformers, i.e. objects which can transform one data frame into another (possibly applying the transformation to the original data frame - in-place transformation). A data frame transformer may require being fitted using training data.
- get_name() str[source]#
- Returns:
the name of this dft transformer, which may be a default name if the name has not been set.
- to_feature_generator(categorical_feature_names: Optional[Union[Sequence[str], str]] = None, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, add_categorical_default_rules=True)[source]#
- chain(*others: DataFrameTransformer) DataFrameTransformerChain[source]#
- get_column_change_tracker() DataFrameColumnChangeTracker[source]#
- class DFTFromFeatureGenerator(fgen: FeatureGenerator, append: bool = False, copy: bool = True)[source]#
Bases:
DataFrameTransformerTransforms a feature generator into a data frame transformer, which either returns the features data frame or the original data frame extended with the features data frame
- Parameters:
fgen – the feature generator from which to generate
append – whether to append the columns generated by the feature generator to the input data frame; if False, the transformed data frame will consist only of the generated features data frame
copy – whether, for the case where append=True, the returned data frame shall copy the data of the input data frame (rather than reuse the data)
- class InvertibleDataFrameTransformer[source]#
Bases:
DataFrameTransformer,ABC- get_inverse() InverseDataFrameTransformer[source]#
- Returns:
a transformer whose (forward) transformation is the inverse transformation of this DFT
- class RuleBasedDataFrameTransformer[source]#
Bases:
DataFrameTransformer,ABCBase class for transformers whose logic is entirely based on rules and does not need to be fitted to data
- class InverseDataFrameTransformer(invertible_dft: InvertibleDataFrameTransformer)[source]#
- class DataFrameTransformerChain(*data_frame_transformers: Union[DataFrameTransformer, List[DataFrameTransformer]])[source]#
Bases:
DataFrameTransformerSupports the application of a chain of data frame transformers. During fit and apply each transformer in the chain receives the transformed output of its predecessor.
- find_first_transformer_by_type(cls) Optional[DataFrameTransformer][source]#
- append(t: DataFrameTransformer)[source]#
- class DFTRenameColumns(columns_map: Dict[str, str])[source]#
Bases:
RuleBasedDataFrameTransformer- Parameters:
columns_map – dictionary mapping old column names to new names
- class DFTConditionalRowFilterOnColumn(column: str, condition: Callable[[Any], bool])[source]#
Bases:
RuleBasedDataFrameTransformerFilters a data frame by applying a boolean function to one of the columns and retaining only the rows for which the function returns True
- class DFTInSetComparisonRowFilterOnColumn(column: str, set_to_keep: Set)[source]#
Bases:
RuleBasedDataFrameTransformerFilters a data frame on the selected column and retains only the rows for which the value is in the setToKeep
- class DFTNotInSetComparisonRowFilterOnColumn(column: str, set_to_drop: Set)[source]#
Bases:
RuleBasedDataFrameTransformerFilters a data frame on the selected column and retains only the rows for which the value is not in the setToDrop
- class DFTVectorizedConditionalRowFilterOnColumn(column: str, vectorized_condition: Callable[[pandas.Series], Sequence[bool]])[source]#
Bases:
RuleBasedDataFrameTransformerFilters a data frame by applying a vectorized condition on the selected column and retaining only the rows for which it returns True
- class DFTRowFilter(condition: Callable[[Any], bool])[source]#
Bases:
RuleBasedDataFrameTransformerFilters a data frame by applying a condition function to each row and retaining only the rows for which it returns True
- class DFTModifyColumn(column: str, column_transform: Union[Callable, numpy.ufunc])[source]#
Bases:
RuleBasedDataFrameTransformerModifies a column specified by ‘column’ using ‘columnTransform’
- Parameters:
column – the name of the column to be modified
column_transform – a function operating on single cells or a Numpy ufunc that applies to an entire Series
- class DFTModifyColumnVectorized(column: str, column_transform: Callable[[numpy.ndarray], Union[Sequence, pandas.Series, numpy.ndarray]])[source]#
Bases:
RuleBasedDataFrameTransformerModifies a column specified by ‘column’ using ‘columnTransform’. This transformer can be used to utilise Numpy vectorisation for performance optimisation.
- Parameters:
column – the name of the column to be modified
column_transform – a function that takes a Numpy array and from which the returned value will be assigned to the column as a whole
- class DFTOneHotEncoder(columns: Optional[Union[Sequence[str], str]], categories: Optional[Union[List[numpy.ndarray], Dict[str, numpy.ndarray]]] = None, inplace=False, ignore_unknown=False, array_valued_result=False)[source]#
Bases:
DataFrameTransformerOne hot encode categorical variables
- Parameters:
columns – list of names or regex matching names of columns that are to be replaced by a list one-hot encoded columns each (or an array-valued column for the case where useArrayValues=True); If None, then no columns are actually to be one-hot-encoded
categories – numpy arrays containing the possible values of each of the specified columns (for case where sequence is specified in ‘columns’) or dictionary mapping column name to array of possible categories for the column name. If None, the possible values will be inferred from the columns
inplace – whether to perform the transformation in-place
ignore_unknown – if True and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. if False, an unknown category will raise an error.
array_valued_result – whether to replace the input columns by columns of the same name containing arrays as values instead of creating a separate column per original value
- class DFTColumnFilter(keep: Optional[Union[Sequence[str], str]] = None, drop: Optional[Union[Sequence[str], str]] = None)[source]#
Bases:
RuleBasedDataFrameTransformerA DataFrame transformer that filters columns by retaining or dropping specified columns
- class DFTKeepColumns(keep: Optional[Union[Sequence[str], str]] = None, drop: Optional[Union[Sequence[str], str]] = None)[source]#
Bases:
DFTColumnFilter
- class DFTNormalisation(rules: Sequence[Rule], default_transformer_factory: Optional[Callable[[], SkLearnTransformerProtocol]] = None, require_all_handled: bool = True, inplace: bool = False)[source]#
Bases:
DataFrameTransformerApplies normalisation/scaling to a data frame by applying a set of transformation rules, where each rule defines a set of columns to which it applies (learning a single transformer based on the values of all applicable columns). DFTNormalisation ignores N/A values during fitting and application.
- Parameters:
rules – the set of rules; rules (i.e., their transformers) are always fitted and applied in the given order. A convenient way to obtain a set of rules in the
sensai.vector_model.VectorModelcontext is from asensai.featuregen.FeatureCollectororsensai.featuregen.MultiFeatureGenerator. Generally, it is often a good idea to associate rules (or a rule template) with a feature generator. Then the rules can be obtained from it using get_normalisation_rules.default_transformer_factory – a factory for the creation of transformer instances (which implements the API used by sklearn.preprocessing, e.g. StandardScaler) that shall be used to create a transformer for all rules that do not specify a particular transformer. The default transformer will only be applied to columns matched by such rules, unmatched columns will not be transformed. Use
SkLearnTransformerFactoryFactoryto conveniently create a factory.require_all_handled – whether to raise an exception if any column is not matched by a rule
inplace – whether to apply data frame transformations in-place
- class RuleTemplate(skip: bool = False, unsupported: bool = False, transformer: Optional[SkLearnTransformerProtocol] = None, transformer_factory: Optional[Callable[[], SkLearnTransformerProtocol]] = None, independent_columns: Optional[bool] = None, array_valued: bool = False, fit: bool = True)[source]#
Bases:
objectA template from which a rule which matches multiple columns can be created. This is useful for the generation of rules which shall apply to all the (numerical) columns generated by a
FeatureGeneratorwithout specifically naming them.Use the parameters as follows:
If the relevant features are already normalised, pass
skip=TrueIf the relevant features cannot be normalised (e.g. because they are categorical), pass
unsupported=TrueIf the relevant features shall be normalised, the other parameters apply. No parameters, i.e.
RuleTemplate(), are an option if …a default transformer factory is specified in the
DFTNormalisationinstance and its application is suitable for the relevant set of features. Otherwise, specify eithertransformer_factoryortransformer.the resulting rule will match only a single column. Otherwise,
independent_columnsmust be specified to True or False.
- Parameters:
skip – flag indicating whether no transformation shall be performed on the matched columns (e.g. because they are already normalised).
unsupported – flag indicating whether normalisation of matched columns is unsupported (shall trigger an exception if attempted). Useful e.g. for preventing intermediate features that need further processing (like columns containing strings) from making their way into the final dataframe that will be normalised and used for training a model.
transformer – a transformer instance (following the sklearn.preprocessing interface, e.g. StandardScaler) to apply to the matching column(s) for the case where a transformation is necessary (skip=False, unsupported=False). If None is given, either transformer_factory or the containing
DFTNormalisationinstance’s default factory will be used when the normaliser is fitted. NOTE: Using a transformer_factory is usually preferred. Use an instance only if you want the same transformer instance to be used in multiple places - e.g. sharing it across several feature generators or models that use the same type of column with associated rule/rule template (disabling fit where appropriate).transformer_factory – a factory for the generation of the transformer instance, which will only be applied if transformer is not given; if neither transformer nor transformer_factory are given, the containing
DFTNormalisationinstance’s default factory will be used. SeeSkLearnTransformerFactoryFactoryfor convenient construction options.array_valued – whether the column values are not scalars but arrays (of some fixed but arbitrary length). It is assumed that all entries in such arrays are to be normalised in the same way, i.e. the same transformation will be applied to each entry in the array. Only a single matching column is supported for array_valued=True, i.e. the rule must apply to at most one column.
fit – whether the rule’s transformer shall be fitted. One use case for setting this to False is if a transformer instance is provided (instead of a factory), which is already fitted.
independent_columns – whether, for the case where the rule matches multiple columns, the columns are independent and a separate transformation is to be learned for each of them (rather than using the same transformation for all columns and learning the transformation from the data of all columns). This parameter must be specified for rules matching more than one column, None is acceptable for rules matching a single column, in which case None, True, and False all have the same effect.
- class Rule(regex: Optional[Union[str, Pattern]], skip: bool = False, unsupported: bool = False, transformer: Optional[SkLearnTransformerProtocol] = None, transformer_factory: Optional[Callable[[], SkLearnTransformerProtocol]] = None, array_valued: bool = False, fit: bool = True, independent_columns: Optional[bool] = None)[source]#
Bases:
ToStringMixinUse the parameters as follows:
If the relevant features are already normalised, pass
skip=TrueIf the relevant features cannot be normalised (e.g. because they are categorical), pass
unsupported=TrueIf the relevant features shall be normalised, the other parameters apply. No parameters other than regex, i.e.
Rule(regex), are an option if …a default transformer factory is specified in the
DFTNormalisationinstance and its application is suitable for the relevant set of features. Otherwise, specify eithertransformer_factoryortransformer.the resulting rule will match only a single column. Otherwise,
independent_columnsmust be specified to True or False.
- Parameters:
regex – a regular expression defining the column(s) the rule applies to. If it matches multiple columns, these columns will be normalised in the same way (using the same normalisation process for each column) unless independent_columns=True. If None, the rule is a placeholder rule and the regex must be set later via set_regex or the rule will not be applicable.
skip – flag indicating whether no transformation shall be performed on the matched columns (e.g. because they are already normalised).
unsupported – flag indicating whether normalisation of matched columns is unsupported (shall trigger an exception if attempted). Useful e.g. for preventing intermediate features that need further processing (like columns containing strings) from making their way into the final dataframe that will be normalised and used for training a model.
transformer – a transformer instance (following the sklearn.preprocessing interface, e.g. StandardScaler) to apply to the matching column(s) for the case where a transformation is necessary (skip=False, unsupported=False). If None is given, either transformer_factory or the containing
DFTNormalisationinstance’s default factory will be used when the normaliser is fitted. NOTE: Using a transformer_factory is usually preferred. Use an instance only if you want the same transformer instance to be used in multiple places - e.g. sharing it across several feature generators or models that use the same type of column with associated rule/rule template (disabling fit where appropriate).transformer_factory – a factory for the generation of the transformer instance, which will only be applied if transformer is not given; if neither transformer nor transformer_factory are given, the containing
DFTNormalisationinstance’s default factory will be used. SeeSkLearnTransformerFactoryFactoryfor convenient construction options.array_valued – whether the column values are not scalars but arrays (of some fixed but arbitrary length). It is assumed that all entries in such arrays are to be normalised in the same way, i.e. the same transformation will be applied to each entry in the array. Only a single matching column is supported for array_valued=True, i.e. the regex must match at most one column.
fit – whether the rule’s transformer shall be fitted. One use case for setting this to False is if a transformer instance is provided (instead of a factory), which is already fitted.
independent_columns – whether, for the case where the rule matches multiple columns, the columns are independent and a separate transformation is to be learned for each of them (rather than using the same transformation for all columns and learning the transformation from the data of all columns). This parameter must be specified to for rules matching more than one column, None is acceptable for rules matching a single column, in which case None, True, and False all have the same effect.
- class DFTFromColumnGenerators(column_generators: Sequence[ColumnGenerator], inplace=False)[source]#
Bases:
RuleBasedDataFrameTransformerExtends a data frame with columns generated from ColumnGenerator instances
- class DFTCountEntries(column_for_entry_count: str, column_name_for_resulting_counts: str = 'counts')[source]#
Bases:
RuleBasedDataFrameTransformerTransforms a data frame, based on one of its columns, into a new data frame containing two columns that indicate the counts of unique values in the input column. It is the “DataFrame output version” of pd.Series.value_counts. Each row of the output column holds a unique value of the input column and the number of times it appears in the input column.
- class DFTSkLearnTransformer(sklearn_transformer: SkLearnTransformerProtocol, columns: Optional[List[str]] = None, inplace=False, array_valued=False)[source]#
Bases:
InvertibleDataFrameTransformerApplies a transformer from sklearn.preprocessing to (a subset of) the columns of a data frame. If multiple columns are transformed, they are transformed independently (i.e. each column uses a separately trained transformation).
- Parameters:
sklearn_transformer – the transformer instance (from sklearn.preprocessing) to use (which will be fitted & applied)
columns – the set of column names to which the transformation shall apply; if None, apply it to all columns
inplace – whether to apply the transformation in-place
array_valued – whether to apply transformation not to scalar-valued columns but to one or more array-valued columns, where the values of all arrays within a column (which may vary in length) are to be transformed in the same way. If multiple columns are transformed, then the arrays belonging to a single row must all have the same length.
- class DFTSortColumns[source]#
Bases:
RuleBasedDataFrameTransformerSorts a data frame’s columns in ascending order
- class DFTFillNA(fill_value, inplace: bool = False)[source]#
Bases:
RuleBasedDataFrameTransformerFills NA/NaN values with the given value
- class DFTCastCategoricalColumns(columns: ~typing.Optional[~typing.List[str]] = None, dtype=<class 'float'>)[source]#
Bases:
RuleBasedDataFrameTransformerCasts columns with dtype category to the given type. This can be useful in cases where categorical columns are not accepted by the model but the column values are actually numeric, in which case the cast to a numeric value yields an acceptable label encoding.
- Parameters:
columns – the columns to convert; if None, convert all that have dtype category
dtype – the data type to which categorical columns are to be converted
- class DFTDropNA(axis=0, inplace=False)[source]#
Bases:
RuleBasedDataFrameTransformerDrops rows or columns containing NA/NaN values
- Parameters:
axis – 0 to drop rows, 1 to drop columns containing an N/A value
inplace – whether to perform the operation in-place on the input data frame