cut_operator¶
Operator to convert a continuous variable to a categorical variable, useful for binning data
-
class
tasrif.processing_pipeline.pandas.cut_operator.
CutOperator
(cut_column_name, bin_column_name, bins, **kwargs)¶ Bin values into discrete intervals using Pandas
cut
Examples
>>> import pandas as pd >>> import numpy as np >>> import datetime >>> from tasrif.processing_pipeline.pandas import CutOperator >>> >>> >>> df = pd.DataFrame({ ... 'Time': pd.date_range('2018-01-01', '2018-01-10', freq='1H', closed='left'), ... 'Steps': np.random.randint(100,5000, size=9*24), ... } ... ) >>> >>> ids = [] >>> for i in range(1, 217): ... ids.append(i%10 + 1) >>> >>> df["Id"] = ids >>> df ### input ### Time Steps Id 0 2018-01-01 00:00:00 1554 2 1 2018-01-01 01:00:00 1583 3 2 2018-01-01 02:00:00 1540 4 3 2018-01-01 03:00:00 4760 5 4 2018-01-01 04:00:00 1671 6 ... ... ... ... 211 2018-01-09 19:00:00 298 3 212 2018-01-09 20:00:00 1059 4 213 2018-01-09 21:00:00 556 5 214 2018-01-09 22:00:00 3021 6 215 2018-01-09 23:00:00 4449 7
>>> # 4 Equal width bins >>> df1 = df.copy() >>> operator = CutOperator(cut_column_name='Steps', ... bin_column_name='Bin', ... bins=4, ... retbins=True) >>> >>> df1, bins = operator.process(df1)[0] >>> print('Bins:', bins) >>> df1 ### output 1 ### Bins: [ 147.178 1357.5 2563. 3768.5 4974. ] Time Steps Id Bin 0 2018-01-01 00:00:00 3911 2 (3768.5, 4974.0] 1 2018-01-01 01:00:00 360 3 (147.178, 1357.5] 2 2018-01-01 02:00:00 4466 4 (3768.5, 4974.0] 3 2018-01-01 03:00:00 1983 5 (1357.5, 2563.0] 4 2018-01-01 04:00:00 3059 6 (2563.0, 3768.5] ... ... ... ... ... 211 2018-01-09 19:00:00 4387 3 (3768.5, 4974.0] 212 2018-01-09 20:00:00 1679 4 (1357.5, 2563.0] 213 2018-01-09 21:00:00 2445 5 (1357.5, 2563.0] 214 2018-01-09 22:00:00 2028 6 (1357.5, 2563.0] 215 2018-01-09 23:00:00 268 7 (147.178, 1357.5]
>>> # Custom bins >>> cut_labels = ['Sedentary', "Light", 'Moderate', 'Vigorous'] >>> cut_bins =[0, 500, 2000, 6000, float("inf")] >>> >>> df2 = df.copy() >>> operator = CutOperator(cut_column_name='Steps', >>> bin_column_name='Bin', >>> bins=cut_bins, >>> labels=cut_labels) >>> >>> df2 = operator.process(df1)[0] >>> print(df2['Bin'].value_counts()) >>> df2 ### Output 2 ### Moderate 135 Light 64 Sedentary 17 Vigorous 0 Name: Bin, dtype: int64 Time Steps Id Bin 0 2018-01-01 00:00:00 3911 2 Moderate 1 2018-01-01 01:00:00 360 3 Sedentary 2 2018-01-01 02:00:00 4466 4 Moderate 3 2018-01-01 03:00:00 1983 5 Light 4 2018-01-01 04:00:00 3059 6 Moderate ... ... ... ... ... 211 2018-01-09 19:00:00 4387 3 Moderate 212 2018-01-09 20:00:00 1679 4 Light 213 2018-01-09 21:00:00 2445 5 Moderate 214 2018-01-09 22:00:00 2028 6 Moderate 215 2018-01-09 23:00:00 268 7 Sedentary
-
__init__
(cut_column_name, bin_column_name, bins, **kwargs)¶ Initializes the operator
- Parameters
cut_column_name (str) – Name of the column to perform the cut operation on
bin_column_name (str) – Name of the column representing the bins
bins (int, sequence of scalars, or IntervalIndex) –
int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.
sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.
IntervalIndex : Defines the exact bins to be used. Note that IntervalIndex for bins must be non-overlapping.
**kwargs – key word arguments passed to pandas
cut
method
-