clintk.cat2vec.lasso_gridsearch module

The objective of this script is to select the best categories of a high cardinality categorical feature using LASSO penalization.

For the moment only binary/continuous logistic regression is implemented

>> reload_ext autoreload >> autoreload 2

clintk.cat2vec.lasso_gridsearch.lr_coefficients(path, features, targets, key, output_path, **kwargs)[source]

Performs categorical variable selection using L1-penalized logistic regression model

It only supports binary or continuous target for the moment

Parameters:
  • path (str) – input path or url for the dataframe
  • features (str) – column name of the categorical column
  • targets (str) – name of the target column in the df
  • key (str) – key to group categorical variables
  • output_path (str) – path to save the coefficients in a csv file
  • kwargs – keyword arguments for the hyperparameter grid
Returns:

  • array
  • the coefficients of the L1-logistic regression

Examples

>>> lr_coefficients('input.csv', 'medication_name', 'target',     solver=['liblinear', 'saga'], C=np.logspace(-6, 2, 10))