clintk.text2vec.w2v_clusters module

clustering of word embeddings

@TODO documentation of the module

class clintk.text2vec.w2v_clusters.WordClustering(w2v_size=128, n_clusters=30, clustering=KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=30, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0), pretrained=False, model_path=None)[source]

Bases: sklearn.base.BaseEstimator

theme-affinity vectorization of documents

w2v_size : int, default=128
size of the hidden layer in the embedding Word2Vec model
n_clusters : int, default=30
number of clusters, to the number of output parameters for the vectorization. It is advised to set n_clusters to the approximate number of lexical fields
clustering : sklearn.cluster instace, default=KMeans(n_clusters=30)
clustering algorithm The number of clusters must be equal to n_clusters
pretrained : bool, default=False
False to train a new w2v model True to use a model already trained
model_path : str, default=None
path to the trained w2v model Only used when pretrained is set to True
fit(X=None, y=None, **fit_params)[source]

train w2v and clustering models

Parameters:X (iterable of iterable, defaul=None) – corpus of tokenized documents if `pretrained`=False else, X=None and the pretrained model is used

y : None

fit_params : additionnal parameters for word2vec algorithm

Returns:
Return type:self
get_clusters_words()[source]

return the words in each cluster

Returns:keys are cluster ids, values are lists of words
Return type:dict
transform(X, y=None)[source]

transforms each row of X into a vector of clusters affinities

Parameters:
  • X (iterable of iterable)
  • y (None)
Returns:

transformed docments, where p=n_cluster

Return type:

numpy.ndarray, shape=(n, p)

clintk.text2vec.w2v_clusters.embed_corpus(X, n_clusters, clustering, **kwargs)[source]

transforms X into vector of cluster affinities

..deprecated use WordClustering object instead :Parameters: * X (iterable of iterable, (length=n)) – corpus of document

  • clustering (sklearn.cluster object) – instanciated clustering algorithm
Returns:
Return type:np.ndarray, shape=(n, n_clusters)