clintk.text2vec.tools module

clintk.text2vec.tools.avg_corpus(model, corpus)[source]

computes average vector for each document of the corpus

Parameters:
  • model (gensim.word2vec.Word2Vec instance) – Trained word2vec model
  • corpus (iterable of iterables)
clintk.text2vec.tools.avg_document(model, document)[source]

computes the average vector of the words in document in the word2vec model space

Parameters:
  • model (word2vec.KeyedVectors instance)
  • document (list) – tokenized document to fold into a vector
Returns:

avg – the average of all the words in document

Return type:

np.ndarray

clintk.text2vec.tools.text_normalize(text, stop_words, stem=False)[source]

This functions performs the preprocessing steps needed to optimize the vectorization, such as normalization stop words removal, lemmatization etc…

stemming for french not accurate enough yet @TODO lemmatization for french + adapt stemmer for other languages

Parameters:
  • text (string) – text to normalize
  • stop_words (list) – list of additionnal stopwords to remove from the text
  • stem (bool) – if True, stems the words to fetch the meaning of the words However, this functionality does not perform well with french
Returns:

same text as input but cleansed and normalized

Return type:

string