statistics_operator

Operator to aggregate column features based on a column

class tasrif.processing_pipeline.custom.statistics_operator.StatisticsOperator(participant_identifier='Id', date_feature_name='Date', filter_features=None)

Compute statistics of a 2D timeseries dataframe such for each feature and returns the computed statistics as a data frame.

Examples

>>> import pandas as pd
>>> from tasrif.processing_pipeline.custom import StatisticsOperator
>>> df = pd.DataFrame( [
>>>     ['2020-02-20', 1000, 1800, 1], ['2020-02-21', 5000, 2100, 1], ['2020-02-22', 10000, 2400, 1],
>>>     ['2020-02-20', 1000, 1800, 1], ['2020-02-21', 5000, 2100, 1], ['2020-02-22', 10000, 2400, 1],
>>>     ['2020-02-20', 0, 1600, 2], ['2020-02-21', 4000, 2000, 2], ['2020-02-22', 11000, 2400, 2],
>>>     ['2020-02-20', None, 2000, 3], ['2020-02-21', 0, 2700, 3], ['2020-02-22', 15000, 3100, 3]],
>>> columns=['Day', 'Steps', 'Calories', 'PersonId'])
>>>
>>> filter_features = {
...     'Steps': lambda x : x > 0
... }
>>> sop = StatisticsOperator(participant_identifier='PersonId',
...                          date_feature_name='Day', filter_features=filter_features)
>>> sop.process(df)
[                   statistic         Day       Steps    Calories    PersonId
0                  row_count          12           9          12          12
1         missing_data_count           0           1           0           0
2       duplicate_rows_count           3           3           3           3
3          participant_count           3           3           3           3
4                   min_date  2020-02-20  2020-02-20  2020-02-20  2020-02-20
5                   max_date  2020-02-22  2020-02-22  2020-02-22  2020-02-22
6                   duration           2           2           2           2
7  mean_days_per_participant           4           3           4           3
8  mean_participants_per_day           3           3           4           4]
__init__(participant_identifier='Id', date_feature_name='Date', filter_features=None)

Creates a new instance of StatisticsOperator

Parameters
  • participant_identifier (str) – Name of the feature identifying the participant

  • date_feature_name (str) – Name of the feature identifying the date

  • filter_features (dict) – Dictionary of column/feature name to (lambda) function providing a selection clause. Note that if a column or feature name is omitted then a default selection of non-zero or non-empty values is applied.