clintk.utils.fold module

As data may come from different sources, it is best to retrieve all the bases into one single dataframe that would enables fetching the features very easily, as well as the dates at which the events/measures occured.

Doing so allows to retrieve the full timelines of the patients and therefore complete various tasks.

The objective of this module is to parse the databases available in order to have each one of them organized as

key1 | key2 | feature_name | value | date

class clintk.utils.fold.Folder(key1, key2, features, date, n_jobs=1)[source]

Bases: object

This object enables “unfolding” the features of a DataFrame, which means for a df that has 5 feature columns for instance, the unfolding would result in two feature columns: one is for the feature name and the other is the feature value.

All the attributes are column names to indicate how to unfold the dataframe

Parameters:
  • key1 (str) – indicator of the primary key indicator
  • key2 (str, (optionnal?)) – secondary key
  • features (list) – column names that contain the features
  • date (str) – name of the date column,
  • n_jobs (int) – number of CPUs to use for computation. If -1, all the available cores are used
fold(df_base)[source]
Parameters:df_base (pandas DataFrame)
Returns:columns are [key1, key2, feature, value, date] where feature contains the features names and values are the values.
Return type:pandas.DataFrame

Examples

>>> df = pd.DataFrame({'id1': [1, 2, 3], 'id2': ['id1', 'id2', 'id3'],
...                    'feature_a': [0, 0.3, 1.4],
...                    'date': ["12122012", "12122012","12122012"]})
>>> folder = fold.Folder('id1', 'id2', ['feature_a'], 'date')
>>> folded = folder.fold(df)
>>> print(folded)
   id1  id2    feature  value      date
0    1  id1  feature_a    0.0  12122012
1    2  id2  feature_a    0.3  12122012
2    3  id3  feature_a    1.4  12122012
For several features:
>>> df['feature_b'] = [-1, 1, 0]
>>> folder = fold.Folder('id1', 'id2', ['feature_a', 'feature_b'],
... 'date')
>>> folded = folder.fold(df)
>>> print(folded)
   id1  id2    feature  value      date
0    1  id1  feature_a    0.0  12122012
1    1  id1  feature_b   -1.0  12122012
2    2  id2  feature_a    0.3  12122012
3    2  id2  feature_b    1.0  12122012
4    3  id3  feature_a    1.4  12122012
5    3  id3  feature_b    0.0  12122012