clintk.cat2vec.neural_embedding module¶
Embedding high cardinality categorical variables with distributed representations
The first embedder relies on Word2Vec algorithm to learn vector representations of words in a corpus
[1] | “Distributed Representations of Words and Phrases and their Compositionality”, Mikolov et al, Advances in Neural Information Processing Systems 26, pp 3111–3119, 2013. |
The second one is based on transfer learning : we train a fully connected neural network on a predictive task (only supports binary classification for now) so that the upper layers learn higher level representations of the categories. After training, we can extract the categories vectors in the embedding space
-
class
clintk.cat2vec.neural_embedding.
NeuralEmbedder
(input_dim, layers, activation='relu', output='sigmoid', optimizer='adam', loss='binary-crossentropy', dropout=0.5, metrics=['acc', 'mae'], epochs=20)[source]¶ Bases:
sklearn.base.BaseEstimator
Trains a MLP classifier to learn a distributed representation of categories
Only available for binary targets
@TODO optimizer argument should be able to receive keras.Optimizer class @TODO + batch_size + validation set ?
- input_dim : tuple, (int, int)
- input_dim[0] number of units in inpuot layer input_dim[1] : dimension of the input layer (= number of features)
- layers : tuple
- The ith element represents the number of neurons in the ith hidden layer. Similar to sklearn’s MLP
- activation : str, default=’relu’
- activation function in the intermediate layers
- output : str, default=’sigmoid’
- output activation function, only supports sigmoid for binary classification
- optimizer : str, default=’adam’
- optimizing function for backpropagation check https://keras.io/optimizers for available algorithms
- loss : str, default=’binary-crossentropy’
- loss computed for optimization check https://keras.io/losses
- dropout : str, default=0.5
- dropout rate
- metrics : list, default=[‘acc’, ‘mae’]
- metrics used uring training and testing
- epochs : int, default=20
- number of epochs
-
fit
(X, y)[source]¶ trains the model using input data
Parameters: - X (iterable) – feature matrix
- y (iterable) – target vector (possibly one-hot-encoded?)
Returns: record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable)
Return type: keras.History.history
-
class
clintk.cat2vec.neural_embedding.
W2VVectorizer
(group_key, category_col, size=128, min_count=1, sg=1, window=3, strategy='tokens', seed=0)[source]¶ Bases:
object
vectorizes categories with word2vec model
@deprecated
Parameters: - group_key (str) – name of the column to group
- category_col (str) – name of the column containing the categorical variables
- size (int, default=128) – dimension of the embedding vector
- min_count (int, default=1) – minimum amount of instances to integrate it to the model
- sg (int {0, 1}, default=1) – 0 for skip-gram word2vec model 1 for CBOW (best suited for small datasets)
- window (int, default=3) – size of the context
- strategy (str {‘tokens’, ‘strings’}, default=’tokens’) – if ‘tokens’, categories containing several words are split else, each category is considered as a word
-
fit
(X, y=None)[source]¶ fits the model by grouping categories by group_key in order to embed categories as text
Parameters: - X (pd.DataFrame)
- y
-
fit_pretrained
(path, **kwargs)[source]¶ fits model using pretrained word embedding from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
Parameters: path (str) – path do wiki.lg.vec file