drop_duplicates_operator

Remove duplicate values from one or more dataframes.

class tasrif.processing_pipeline.pandas.drop_duplicates_operator.DropDuplicatesOperator(**kwargs)

Remove duplicate rows from one or more dataframes.

Examples

>>> import pandas as pd
>>> import numpy as np
>>>
>>> from tasrif.processing_pipeline.pandas import DropDuplicatesOperator
>>>
>>> df0 = pd.DataFrame([['Tom', 10], ['Alfred', 15], ['Alfred', 18], ['Juli', 14]], columns=['name', 'score'])
>>> df1 = pd.DataFrame({"name": ['Alfred', 'juli', 'Tom', 'Ali'],
...                   "height": [np.nan, 155, 159, 165],
...                   "born": [pd.NaT, pd.Timestamp("2010-04-25"), pd.NaT,
...                            pd.NaT]})
>>>
>>> operator = DropDuplicatesOperator(subset='name')
>>> df0, df1 = operator.process(df0, df1)
>>>
>>> print(df0)
>>> print(df1)
     name  score
0     Tom     10
1  Alfred     15
3    Juli     14
     name  height       born
0  Alfred     NaN        NaT
1    juli   155.0 2010-04-25
2     Tom   159.0        NaT
3     Ali   165.0        NaT
__init__(**kwargs)

Initializes the operator

Parameters

**kwargs – key word arguments passed to pandas DataFrame.drop_duplicates method