clintk.text2vec.w2v_clusters module¶
clustering of word embeddings
@TODO documentation of the module
- 
class clintk.text2vec.w2v_clusters.WordClustering(w2v_size=128, n_clusters=30, clustering=KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=30, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0), pretrained=False, model_path=None)[source]¶
- Bases: - sklearn.base.BaseEstimator- theme-affinity vectorization of documents - w2v_size : int, default=128
- size of the hidden layer in the embedding Word2Vec model
- n_clusters : int, default=30
- number of clusters, to the number of output parameters for the vectorization. It is advised to set n_clusters to the approximate number of lexical fields
- clustering : sklearn.cluster instace, default=KMeans(n_clusters=30)
- clustering algorithm The number of clusters must be equal to n_clusters
- pretrained : bool, default=False
- False to train a new w2v model True to use a model already trained
- model_path : str, default=None
- path to the trained w2v model Only used when pretrained is set to True
 - 
fit(X=None, y=None, **fit_params)[source]¶
- train w2v and clustering models - Parameters: - X (iterable of iterable, defaul=None) – corpus of tokenized documents if `pretrained`=False else, X=None and the pretrained model is used - y : None - fit_params : additionnal parameters for word2vec algorithm - Returns: - Return type: - self 
 
- 
clintk.text2vec.w2v_clusters.embed_corpus(X, n_clusters, clustering, **kwargs)[source]¶
- transforms X into vector of cluster affinities - ..deprecated use WordClustering object instead :Parameters: * X (iterable of iterable, (length=n)) – corpus of document - clustering (sklearn.cluster object) – instanciated clustering algorithm
 - Returns: - Return type: - np.ndarray, shape=(n, n_clusters)