statistics_operator¶
Operator to aggregate column features based on a column
-
class
tasrif.processing_pipeline.custom.statistics_operator.
StatisticsOperator
(participant_identifier='Id', date_feature_name='Date', filter_features=None)¶ Compute statistics of a 2D timeseries dataframe such for each feature and returns the computed statistics as a data frame.
Examples
>>> import pandas as pd >>> from tasrif.processing_pipeline.custom import StatisticsOperator >>> df = pd.DataFrame( [ >>> ['2020-02-20', 1000, 1800, 1], ['2020-02-21', 5000, 2100, 1], ['2020-02-22', 10000, 2400, 1], >>> ['2020-02-20', 1000, 1800, 1], ['2020-02-21', 5000, 2100, 1], ['2020-02-22', 10000, 2400, 1], >>> ['2020-02-20', 0, 1600, 2], ['2020-02-21', 4000, 2000, 2], ['2020-02-22', 11000, 2400, 2], >>> ['2020-02-20', None, 2000, 3], ['2020-02-21', 0, 2700, 3], ['2020-02-22', 15000, 3100, 3]], >>> columns=['Day', 'Steps', 'Calories', 'PersonId']) >>> >>> filter_features = { ... 'Steps': lambda x : x > 0 ... } >>> sop = StatisticsOperator(participant_identifier='PersonId', ... date_feature_name='Day', filter_features=filter_features) >>> sop.process(df) [ statistic Day Steps Calories PersonId 0 row_count 12 9 12 12 1 missing_data_count 0 1 0 0 2 duplicate_rows_count 3 3 3 3 3 participant_count 3 3 3 3 4 min_date 2020-02-20 2020-02-20 2020-02-20 2020-02-20 5 max_date 2020-02-22 2020-02-22 2020-02-22 2020-02-22 6 duration 2 2 2 2 7 mean_days_per_participant 4 3 4 3 8 mean_participants_per_day 3 3 4 4]
-
__init__
(participant_identifier='Id', date_feature_name='Date', filter_features=None)¶ Creates a new instance of StatisticsOperator
- Parameters
participant_identifier (str) – Name of the feature identifying the participant
date_feature_name (str) – Name of the feature identifying the date
filter_features (dict) – Dictionary of column/feature name to (lambda) function providing a selection clause. Note that if a column or feature name is omitted then a default selection of non-zero or non-empty values is applied.
-