OperatorsΒΆ

The basic unit of processing in Tasrif is the Operator. This is the mechanism through which complex functionality is neatly packaged for use in processing workflows.

Operators take as input and return as output Pandas DataFrames. Operators can also process multiple DataFrames at the same time.

As an example, consider the DropNAOperator that can be used to drop rows with missing values in input DataFrames:

>>> import pandas as pd
>>> from tasrif.processing_pipeline.pandas import DropNAOperator

>>> df1 = pd.DataFrame({
...     'Date':   ['05-06-2021', '06-06-2021', '07-06-2021', '08-06-2021'],
...     'Steps':  [        4500,         None,         5690,         6780]
... })

>>> df2 = pd.DataFrame({
...     'Date':   ['12-07-2021', '13-07-2021', '14-07-2021', '15-07-2021'],
...     'Steps':  [        2100,         None,         None,         5400]
... })

>>> operator = DropNAOperator()
>>> dfs = operator.process(df1, df2)

>>> dfs[0]
        Date   Steps
0  05-06-2021  4500.0
2  07-06-2021  5690.0
3  08-06-2021  6780.0

>>> dfs[1]
        Date   Steps
0  12-07-2021  2100.0
3  15-07-2021  5400.0

To use an Operator, we first instantiate it with its appropriate parameters. Since the DropNAOperator is built on top of pandas.DataFrame.dropna, the same parameters can also be passed in.

Next, we call the .process method on the newly created Operator, and then pass in the input DataFrames. The Operator replaces all instances of red with green in both DataFrames, and returns them both in a list.

Operators are more useful when combined together to execute a processing workflow. In the next section, we will see how to chain together multiple Operators to form a processing Pipeline.