clintk.text2vec.w2v_clusters module¶

clustering of word embeddings

@TODO documentation of the module

class clintk.text2vec.w2v_clusters.WordClustering(w2v_size=128, n_clusters=30, clustering=KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=30, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0), pretrained=False, model_path=None)[source]¶

Bases: sklearn.base.BaseEstimator

theme-affinity vectorization of documents

w2v_size : int, default=128: size of the hidden layer in the embedding Word2Vec model
n_clusters : int, default=30: number of clusters, to the number of output parameters for the vectorization. It is advised to set n_clusters to the approximate number of lexical fields
clustering : sklearn.cluster instace, default=KMeans(n_clusters=30): clustering algorithm The number of clusters must be equal to n_clusters
pretrained : bool, default=False: False to train a new w2v model True to use a model already trained
model_path : str, default=None: path to the trained w2v model Only used when pretrained is set to True

fit(X=None, y=None, **fit_params)[source]¶

train w2v and clustering models

Parameters:	X (iterable of iterable, defaul=None) – corpus of tokenized documents if `pretrained`=False else, X=None and the pretrained model is used

y : None

fit_params : additionnal parameters for word2vec algorithm

Returns:
Return type:	self

get_clusters_words()[source]¶

return the words in each cluster

Returns:	keys are cluster ids, values are lists of words
Return type:	dict

transform(X, y=None)[source]¶

transforms each row of X into a vector of clusters affinities

Parameters:	X (iterable of iterable) y (None)
Returns:	transformed docments, where p=n_cluster
Return type:	numpy.ndarray, shape=(n, p)

clintk.text2vec.w2v_clusters.embed_corpus(X, n_clusters, clustering, **kwargs)[source]¶

transforms X into vector of cluster affinities

..deprecated use WordClustering object instead :Parameters: * X (iterable of iterable, (length=n)) – corpus of document

clustering (sklearn.cluster object) – instanciated clustering algorithm

Returns:
Return type:	np.ndarray, shape=(n, n_clusters)