clintk.text2vec.tools module¶
-
clintk.text2vec.tools.
avg_corpus
(model, corpus)[source]¶ computes average vector for each document of the corpus
Parameters: - model (gensim.word2vec.Word2Vec instance) – Trained word2vec model
- corpus (iterable of iterables)
-
clintk.text2vec.tools.
avg_document
(model, document)[source]¶ computes the average vector of the words in document in the word2vec model space
Parameters: - model (word2vec.KeyedVectors instance)
- document (list) – tokenized document to fold into a vector
Returns: avg – the average of all the words in document
Return type: np.ndarray
-
clintk.text2vec.tools.
text_normalize
(text, stop_words, stem=False)[source]¶ This functions performs the preprocessing steps needed to optimize the vectorization, such as normalization stop words removal, lemmatization etc…
stemming for french not accurate enough yet @TODO lemmatization for french + adapt stemmer for other languages
Parameters: - text (string) – text to normalize
- stop_words (list) – list of additionnal stopwords to remove from the text
- stem (bool) – if True, stems the words to fetch the meaning of the words However, this functionality does not perform well with french
Returns: same text as input but cleansed and normalized
Return type: string