participation_overview_operator¶
Operator to aggregate column features based on a column
-
class
tasrif.processing_pipeline.custom.participation_overview_operator.
ParticipationOverviewOperator
(participant_identifier='Id', date_feature_name='Date', overview_type='participant_vs_features', filter_features=None)¶ Creates a dataframe showing the overview of a dataframe representing data collected from people over several days.
- Specifically two types of overviews are generated:
partipant_vs_features: This overview creates a dataframe where each cell (where row identifies the participant and column identifies the feature) is assigned a value corresponding to the number of days for which data is available (or was measured) for this participant
date_vs_features: This overview creates a dataframe where each cell (where row identifies the date and column identifies the feature) is assigned a value corresponding to the number of participants for which data is available (or was measured) for this day.
Examples
>>> import pandas as pd >>> from tasrif.processing_pipeline.custom import ParticipationOverviewOperator >>> df = pd.DataFrame( [ ... ['2020-02-20', 1000, 1800, 1], ['2020-02-21', 5000, 2100, 1], ['2020-02-22', 10000, 2400, 1], ... ['2020-02-20', 0, 1600, 2], ['2020-02-21', 4000, 2000, 2], ['2020-02-22', 11000, 2400, 2], ... ['2020-02-20', 500, 2000, 3], ['2020-02-21', 0, 2700, 3], ['2020-02-22', 15000, 3100, 3]], ... columns=['Day', 'Steps', 'Calories', 'PersonId']) >>> >>> op = ParticipationOverviewOperator(participant_identifier='PersonId', date_feature_name='Day') >>> df1 = op.process(df) >>> df1 [ PersonId Count Steps Calories 0 1 3 3 3 1 2 3 2 3 2 3 3 2 3]
>>> op2 = ParticipationOverviewOperator(participant_identifier='PersonId', ... date_feature_name='Day', overview_type='date_vs_features') >>> df2 = op2.process(df) >>> df2 [ Day Steps Calories Count 0 2020-02-20 2 3 3 1 2020-02-21 2 3 3 2 2020-02-22 3 3 3]
>>> # Count only days where the number of steps > 1000 >>> od = { ... 'Steps': lambda x: x > 1000 ... } >>> op3 = ParticipationOverviewOperator(participant_identifier='PersonId', >>> date_feature_name='Day', filter_features=od) >>> df3 = op3.process(df) >>> df3 [ PersonId Count Steps Calories 0 1 3 2 3 1 2 3 2 3 2 3 3 1 3]
>>> # Count only days where the number of steps > 1000 >>> >>> op4 = ParticipationOverviewOperator(participant_identifier='PersonId', ... date_feature_name='Day', ... overview_type='date_vs_features', filter_features=od) >>> >>> df4 = op4.process(df) >>> df4 [ Day Steps Calories Count 0 2020-02-20 0 3 3 1 2020-02-21 2 3 3 2 2020-02-22 3 3 3]
-
__init__
(participant_identifier='Id', date_feature_name='Date', overview_type='participant_vs_features', filter_features=None)¶ Creates a new instance of ParticipationOverviewOperator
- Parameters
participant_identifier (str) – Name of the feature identifying the participant
date_feature_name (str) – Name of the feature identifying the date
overview_type (str) – Type of overview which can take one of the two values participant_vs_features or date_vs_features
filter_features (dict) – Dictionary of column/feature name to (lambda) function providing a selection clause. Note that if a column or feature name is omitted then a default selection of non-zero or non-empty values is applied.