clintk.cat2vec.neural_embedding module

Embedding high cardinality categorical variables with distributed representations

The first embedder relies on Word2Vec algorithm to learn vector representations of words in a corpus

[1]“Distributed Representations of Words and Phrases and their Compositionality”, Mikolov et al, Advances in Neural Information Processing Systems 26, pp 3111–3119, 2013.

The second one is based on transfer learning : we train a fully connected neural network on a predictive task (only supports binary classification for now) so that the upper layers learn higher level representations of the categories. After training, we can extract the categories vectors in the embedding space

class clintk.cat2vec.neural_embedding.NeuralEmbedder(input_dim, layers, activation='relu', output='sigmoid', optimizer='adam', loss='binary-crossentropy', dropout=0.5, metrics=['acc', 'mae'], epochs=20)[source]

Bases: sklearn.base.BaseEstimator

Trains a MLP classifier to learn a distributed representation of categories

Only available for binary targets

@TODO optimizer argument should be able to receive keras.Optimizer class @TODO + batch_size + validation set ?

input_dim : tuple, (int, int)
input_dim[0] number of units in inpuot layer input_dim[1] : dimension of the input layer (= number of features)
layers : tuple
The ith element represents the number of neurons in the ith hidden layer. Similar to sklearn’s MLP
activation : str, default=’relu’
activation function in the intermediate layers
output : str, default=’sigmoid’
output activation function, only supports sigmoid for binary classification
optimizer : str, default=’adam’
optimizing function for backpropagation check https://keras.io/optimizers for available algorithms
loss : str, default=’binary-crossentropy’
loss computed for optimization check https://keras.io/losses
dropout : str, default=0.5
dropout rate
metrics : list, default=[‘acc’, ‘mae’]
metrics used uring training and testing
epochs : int, default=20
number of epochs
fit(X, y)[source]

trains the model using input data

Parameters:
  • X (iterable) – feature matrix
  • y (iterable) – target vector (possibly one-hot-encoded?)
Returns:

record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable)

Return type:

keras.History.history

transform(X)[source]

Transform X into a distributed representation learned by fit

Parameters:X (iterable) – feature matrix to embed
Returns:X projected into an embedding space
Return type:numpy array
class clintk.cat2vec.neural_embedding.W2VVectorizer(group_key, category_col, size=128, min_count=1, sg=1, window=3, strategy='tokens', seed=0)[source]

Bases: object

vectorizes categories with word2vec model

@deprecated

Parameters:
  • group_key (str) – name of the column to group
  • category_col (str) – name of the column containing the categorical variables
  • size (int, default=128) – dimension of the embedding vector
  • min_count (int, default=1) – minimum amount of instances to integrate it to the model
  • sg (int {0, 1}, default=1) – 0 for skip-gram word2vec model 1 for CBOW (best suited for small datasets)
  • window (int, default=3) – size of the context
  • strategy (str {‘tokens’, ‘strings’}, default=’tokens’) – if ‘tokens’, categories containing several words are split else, each category is considered as a word
fit(X, y=None)[source]

fits the model by grouping categories by group_key in order to embed categories as text

Parameters:
  • X (pd.DataFrame)
  • y
fit_pretrained(path, **kwargs)[source]

fits model using pretrained word embedding from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

Parameters:path (str) – path do wiki.lg.vec file
transform(X, y=None)[source]
Parameters:
  • X (pd.DataFrame)
  • y (None)