clintk.text2vec.transformers module¶
object classes for sklearn pipeline compatibility
-
class
clintk.text2vec.transformers.
AverageWords2Vector
(n_components=128)[source]¶ Bases:
sklearn.base.BaseEstimator
trains a unsupervised word2vec model, and then fold text data according to it This function is only for convenience in using word2vec in a pipeline
Parameters: n_components (int, default=128) – dimension of the embedding vector -
fit
(parsed_reports, y=None, **kwargs)[source]¶ Trains the word2vec model with given corpus as input
Parameters: - parsed_reports (iterable of iterables) – contains parsed tokenized reports
- y (None)
- **kwargs – additionnal arguments to pass to gensim.Word2Vec (see appropriate documentation for details)
-
fit_pretrained
(path, **kwargs)[source]¶ fits a pretrained model from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
Parameters: path (str) – path to the model
-
-
class
clintk.text2vec.transformers.
Text2Vector
(n_components=128, dm=1, window=3)[source]¶ Bases:
sklearn.base.BaseEstimator
implementation of Doc2Vec model adapted to sklearn for hyperparameters tuning
-
fit
(reports, y=None, **kwargs)[source]¶ tags reports (for gensim’s model consistence) and trains Doc2Vec model on the corpus
Parameters: - reports (iterable of iterables) – list of tokenized reports
- y (not used, default=None)
- **kwargs – additionnal arguments to pass to gensim.Word2Vec (see appropriate documentation for details)
-