Built-In Operators

Tasrif has many built-in operators that suit most eHealth processing workflows.

Currently, built-in Operators can be classified into three groups:

  • Pandas

  • Kats

  • TSFresh

  • Custom

  • Observers

Pandas Operators, as the name suggests, are Operators that are built on top of the Pandas library. These Operators are mostly derived from the Pandas API, and have been used to enrich Tasrif with commonly used operations on Pandas DataFrames.Examples include the DropNaOperator, ReadCsvOperator and ConvertToDatetimeOperator, all of which are derived from their Pandas counterparts.

Kats Operators are built on top of Facebook’s Kats library for time series analysis. Currently, the only Operator present is the CalculateTimeSeriesOperator. This operator is useful to extract useful features , such as, seasonality strength, entropy (how predictable is a time-series), and more.

Tsfresh Operators are built on top of TSFresh library. Currently, the only Operator present is the TSFreshFeatureExtractorOperator. The operator extracts time-series features based on scalable hypothesis tests. The default features returned from the operators are

>>> TSFRESH_FEATURES = {'agg_linear_trend': [{'attr': 'slope', 'chunk_len': 50, 'f_agg': 'mean'},
...                                          {'attr': 'slope', 'chunk_len': 10, 'f_agg': 'var'},
...                                          {'attr': 'slope', 'chunk_len': 5, 'f_agg': 'max'},
...                                          {'attr': 'slope', 'chunk_len': 5, 'f_agg': 'mean'},
...                                          {'attr': 'rvalue', 'chunk_len': 5, 'f_agg': 'max'},
...                                          {'attr': 'slope', 'chunk_len': 50, 'f_agg': 'var'},
...                                          {'attr': 'rvalue', 'chunk_len': 5, 'f_agg': 'mean'},
...                                          {'attr': 'rvalue', 'chunk_len': 5, 'f_agg': 'var'},
...                                          {'attr': 'slope', 'chunk_len': 10, 'f_agg': 'mean'},
...                                          {'attr': 'intercept', 'chunk_len': 5, 'f_agg': 'mean'},
...                                          {'attr': 'slope', 'chunk_len': 50, 'f_agg': 'max'},
...                                          {'attr': 'slope', 'chunk_len': 5, 'f_agg': 'var'},
...                                          {'attr': 'rvalue', 'chunk_len': 10, 'f_agg': 'var'},
...                                          {'attr': 'slope', 'chunk_len': 10, 'f_agg': 'max'},
...                                          {'attr': 'intercept', 'chunk_len': 5, 'f_agg': 'var'},
...                                          {'attr': 'rvalue', 'chunk_len': 10, 'f_agg': 'max'},
...                                          {'attr': 'intercept', 'chunk_len': 5, 'f_agg': 'max'},
...                                          {'attr': 'rvalue', 'chunk_len': 10, 'f_agg': 'mean'},
...                                          {'attr': 'intercept', 'chunk_len': 10, 'f_agg': 'mean'},
...                                          {'attr': 'intercept', 'chunk_len': 10, 'f_agg': 'var'},
...                                          {'attr': 'intercept', 'chunk_len': 10, 'f_agg': 'max'},
...                                          {'attr': 'rvalue', 'chunk_len': 50, 'f_agg': 'max'}],
...                      'linear_trend': [{'attr': 'rvalue'},
...                                       {'attr': 'slope'},
...                                       {'attr': 'intercept'}],
...                      'index_mass_quantile': [{'q': 0.4},
...                                              {'q': 0.7},
...                                              {'q': 0.6},
...                                              {'q': 0.8},
...                                              {'q': 0.3}],
...                      'cwt_coefficients': [{'coeff': 3, 'w': 2, 'widths': (2, 5, 10, 20)},
...                                           {'coeff': 7, 'w': 2, 'widths': (2, 5, 10, 20)}],
...                      'last_location_of_maximum': None,
...                      'fft_coefficient': [{'attr': 'imag', 'coeff': 1},
...                                          {'attr': 'imag', 'coeff': 8}],
...                      'first_location_of_maximum': None,
...                      'energy_ratio_by_chunks': [{'num_segments': 10,
...                                                  'segment_focus': 9}]}

Future operators may include one to extract relevant features from the time-series.

Custom Operators have custom processing functions built by the Tasrif team. Examples include:

  • AddDurationOperator, for computing the duration between events in time series data.

  • CreateFeatureOperator, for adding new columns to DataFrames.

  • StatisticsOperator, for computing statistics such as row count and N/A counts for DataFrames.

Observers allows the user to see the output of intermediate operators in a `SequenceOperator. A user may do the following with Observers:

  • See the “head”, “tail” or “info” of the output of an operator using tasrif.processing_pipeline.observers.Logger

  • See a dataframe after grouping using tasrif.processing_pipeline.observers.GroupbyLogger

  • See “distribution”, “correlation”, “diff” and “missing” using tasrif.processing_pipeline.observers.DataprepObserver

  • Plot data per day per id in a dataframe using tasrif.processing_pipeline.observers.VisualizeDaysObserver

Observer example:

import pandas as pd
import numpy as np
from tasrif.processing_pipeline.pandas import RenameOperator
from tasrif.processing_pipeline.observers import DataprepObserver

# Prepare two days data
two_days = 48*2
idx = pd.date_range("2018-01-01", periods=two_days, freq="30T", name='startTime')
activity = np.random.randint(0, 100, two_days)
df = pd.DataFrame(data=activity, index=idx, columns=['activity'])
df['steps'] = np.random.randint(100, 1000, two_days)
df['sleep'] = False

# reduce activity between 23:30 - 08:00
time_filter = df.between_time(start_time='23:30', end_time='8:00').index
df.loc[time_filter, 'sleep'] = True
df.loc[time_filter, 'activity'] = df.loc[time_filter, 'activity'] / 100
df.loc[time_filter, 'steps'] = 0
df = RenameOperator(columns={"logId": "id"}, observers=[DataprepObserver(method='distribution,missing')]).process(df)
df