clintk.text2vec.transformers module

object classes for sklearn pipeline compatibility

class clintk.text2vec.transformers.AverageWords2Vector(n_components=128)[source]

Bases: sklearn.base.BaseEstimator

trains a unsupervised word2vec model, and then fold text data according to it This function is only for convenience in using word2vec in a pipeline

Parameters:n_components (int, default=128) – dimension of the embedding vector
fit(parsed_reports, y=None, **kwargs)[source]

Trains the word2vec model with given corpus as input

Parameters:
  • parsed_reports (iterable of iterables) – contains parsed tokenized reports
  • y (None)
  • **kwargs – additionnal arguments to pass to gensim.Word2Vec (see appropriate documentation for details)
fit_pretrained(path, **kwargs)[source]

fits a pretrained model from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

Parameters:path (str) – path to the model
transform(parsed_reports)[source]

Turns the documents into vector by averaging over all the words

Parameters:parsed_reports (iterable of iterables)
class clintk.text2vec.transformers.Text2Vector(n_components=128, dm=1, window=3)[source]

Bases: sklearn.base.BaseEstimator

implementation of Doc2Vec model adapted to sklearn for hyperparameters tuning

fit(reports, y=None, **kwargs)[source]

tags reports (for gensim’s model consistence) and trains Doc2Vec model on the corpus

Parameters:
  • reports (iterable of iterables) – list of tokenized reports
  • y (not used, default=None)
  • **kwargs – additionnal arguments to pass to gensim.Word2Vec (see appropriate documentation for details)
transform(reports)[source]

transforms reports in embedding space based on previously trained Doc2Vec model

Parameters:reports (iterable of iterables) – list of tokenized reports
Returns:vectorized reports
Return type:np.ndarray