clintk.text2vec.tools module¶

clintk.text2vec.tools.avg_corpus(model, corpus)[source]¶

computes average vector for each document of the corpus

Parameters:	model (gensim.word2vec.Word2Vec instance) – Trained word2vec model corpus (iterable of iterables)

clintk.text2vec.tools.avg_document(model, document)[source]¶

computes the average vector of the words in document in the word2vec model space

Parameters:	model (word2vec.KeyedVectors instance) document (list) – tokenized document to fold into a vector
Returns:	avg – the average of all the words in document
Return type:	np.ndarray

clintk.text2vec.tools.text_normalize(text, stop_words, stem=False)[source]¶

This functions performs the preprocessing steps needed to optimize the vectorization, such as normalization stop words removal, lemmatization etc…

stemming for french not accurate enough yet @TODO lemmatization for french + adapt stemmer for other languages

Parameters:	text (string) – text to normalize stop_words (list) – list of additionnal stopwords to remove from the text stem (bool) – if True, stems the words to fetch the meaning of the words However, this functionality does not perform well with french
Returns:	same text as input but cleansed and normalized
Return type:	string