one_hot_encoder

Operator to encode (transform) categorical features as a one-hot numeric array.

class tasrif.processing_pipeline.custom.one_hot_encoder.OneHotEncoderOperator(feature_names: list, drop_first: bool = True, separator: str = ',')

Encodes categorical column features from existing features in the data frame. This operator works on a 2D data frames where the columns represent the features. A feature column with X different values will be replaced by X (or X-1) new column features.

Examples

>>> import pandas as pd
>>> from tasrif.processing_pipeline.custom import OneHotEncoderOperator
>>>
>>> df = pd.DataFrame({'id': [1, 2, 3], 'colors': ['red', 'white', 'blue'],
>>>             'cities': ['Doha', 'Vienna', 'Belo Horizonte'],
>>>             'multiple': ["1,2", "1", "1,3"]
>>> df
      id  colors  cities          multiple
  0       1       red         Doha                1,2
  1       2       white   Vienna          1
  2       3       blue    Belo Horizonte  1,3
>>> OneHotEncoderOperator(feature_names=["colors"], drop_first=False).process(df)[0]
          id      cities          multiple        colors=red      colors=white
  0       1       Doha            1,2             1               0
  1       2       Vienna          1               0               1
  2       3       Belo Horizonte  1,3             0               0
>>> OneHotEncoderOperator(feature_names=["colors"], drop_last_expansion=True).process(df)[0]
      id  cities          multiple        colors=blue     colors=red      colors=white
  0       1       Doha            1,2             0               1               0
  1       2       Vienna          1               0               0               1
  2       3       Belo Horizonte  1,3             1               0               0
>>> OneHotEncoderOperator(feature_names=["colors", "multiple"], drop_first=False).process(df)[0]
      id  cities          colors=blue     colors=red      colors=white    multiple=1      multiple=2      multiple=3
  0       1       Doha            0               1               0                   1           1               0
  1       2       Vienna          0               0               1                   1           0               0
  2       3       Belo Horizonte  1               0               0                   1           0               1
__init__(feature_names: list, drop_first: bool = True, separator: str = ',')

Creates a new instance of OneHotEncoderOperator

Parameters
  • feature_names (list) – The list of categorical features that will be one hot encoded

  • drop_first (bool) – Transforming a category of X values into exactly X columns most of the times is redundant. If the values are NOT multiple choice, any of the new columns can be removed without resulting in loss of information, at least in a ML perspective. Default: False

  • separator (str) – That is the separator, if any, used when the values in a column are represented as multiple choice. Default: ‘,’