qcut_operator¶
Operator to convert a continuous variable to a categorical variable, useful for binning data
-
class
tasrif.processing_pipeline.pandas.qcut_operator.
QCutOperator
(cut_column_name, bin_column_name, quantile, **kwargs)¶ Quantile-based discretization function using Pandas
qcut
Examples
>>> import pandas as pd >>> import numpy as np >>> import datetime >>> >>> from tasrif.processing_pipeline.pandas import CutOperator >>> >>> >>> df = pd.DataFrame({ ... 'Time': pd.date_range('2018-01-01', '2018-01-10', freq='1H', closed='left'), ... 'Steps': np.random.randint(100,5000, size=9*24), ... } ... ) >>> >>> ids = [] >>> for i in range(1, 217): ... ids.append(i%10 + 1) >>> >>> df["Id"] = ids ### Input ### Time Steps Id 0 2018-01-01 00:00:00 4974 2 1 2018-01-01 01:00:00 3377 3 2 2018-01-01 02:00:00 293 4 3 2018-01-01 03:00:00 3389 5 4 2018-01-01 04:00:00 1906 6 ... ... ... ... 211 2018-01-09 19:00:00 4715 3 212 2018-01-09 20:00:00 1947 4 213 2018-01-09 21:00:00 2181 5 214 2018-01-09 22:00:00 2701 6 215 2018-01-09 23:00:00 3444 7
>>> # 4 Equally distributed bins >>> df1 = df.copy() >>> operator = QCutOperator(cut_column_name='Steps', ... bin_column_name='Bin', ... quantile=4, ... retbins=True) >>> df1, bins = operator.process(df1)[0] >>> print('Bins:', bins) >>> df1 ### Output 1 ### Bins: [ 100. 1341.5 2437.5 3502.25 4987. ] (99.999, 1341.5] 54 (1341.5, 2437.5] 54 (2437.5, 3502.25] 54 (3502.25, 4987.0] 54 Name: Bin, dtype: int64 Time Steps Id Bin 0 2018-01-01 00:00:00 1414 2 (1341.5, 2437.5] 1 2018-01-01 01:00:00 1513 3 (1341.5, 2437.5] 2 2018-01-01 02:00:00 937 4 (99.999, 1341.5] 3 2018-01-01 03:00:00 3551 5 (3502.25, 4987.0] 4 2018-01-01 04:00:00 2573 6 (2437.5, 3502.25] ... ... ... ... ... 211 2018-01-09 19:00:00 2835 3 (2437.5, 3502.25] 212 2018-01-09 20:00:00 409 4 (99.999, 1341.5] 213 2018-01-09 21:00:00 691 5 (99.999, 1341.5] 214 2018-01-09 22:00:00 1533 6 (1341.5, 2437.5] 215 2018-01-09 23:00:00 3018 7 (2437.5, 3502.25]
>>> # Custom bins >>> cut_labels = ['Sedentary', "Light", 'Moderate', 'Vigorous'] >>> quantiles = [0, 0.2, 0.5, 0.80, 1] >>> >>> df2 = df.copy() >>> operator = QCutOperator(cut_column_name='Steps', ... bin_column_name='Bin', ... quantile=quantiles, ... labels=cut_labels) >>> df2 = operator.process(df1)[0] >>> print(df2['Bin'].value_counts()) >>> df2 ### Output 2 ### Moderate 65 Light 64 Sedentary 44 Vigorous 43 Name: Bin, dtype: int64 ... Time Steps Id Bin 0 2018-01-01 00:00:00 1414 2 Light 1 2018-01-01 01:00:00 1513 3 Light 2 2018-01-01 02:00:00 937 4 Sedentary 3 2018-01-01 03:00:00 3551 5 Moderate 4 2018-01-01 04:00:00 2573 6 Moderate ... ... ... ... ... 211 2018-01-09 19:00:00 2835 3 Moderate 212 2018-01-09 20:00:00 409 4 Sedentary 213 2018-01-09 21:00:00 691 5 Sedentary 214 2018-01-09 22:00:00 1533 6 Light 215 2018-01-09 23:00:00 3018 7 Moderate
-
__init__
(cut_column_name, bin_column_name, quantile, **kwargs)¶ Initializes the operator
- Parameters
cut_column_name (str) – Name of the column to perform the cut operation on
bin_column_name (str) – Name of the column representing the bins
quantile (int or list-like of float) – Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.
**kwargs – key word arguments passed to pandas
cut
method
-