clintk.text2vec.w2v_clusters module¶
clustering of word embeddings
@TODO documentation of the module
-
class
clintk.text2vec.w2v_clusters.
WordClustering
(w2v_size=128, n_clusters=30, clustering=KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=30, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0), pretrained=False, model_path=None)[source]¶ Bases:
sklearn.base.BaseEstimator
theme-affinity vectorization of documents
- w2v_size : int, default=128
- size of the hidden layer in the embedding Word2Vec model
- n_clusters : int, default=30
- number of clusters, to the number of output parameters for the vectorization. It is advised to set n_clusters to the approximate number of lexical fields
- clustering : sklearn.cluster instace, default=KMeans(n_clusters=30)
- clustering algorithm The number of clusters must be equal to n_clusters
- pretrained : bool, default=False
- False to train a new w2v model True to use a model already trained
- model_path : str, default=None
- path to the trained w2v model Only used when pretrained is set to True
-
fit
(X=None, y=None, **fit_params)[source]¶ train w2v and clustering models
Parameters: X (iterable of iterable, defaul=None) – corpus of tokenized documents if `pretrained`=False else, X=None and the pretrained model is used y : None
fit_params : additionnal parameters for word2vec algorithm
Returns: Return type: self
-
clintk.text2vec.w2v_clusters.
embed_corpus
(X, n_clusters, clustering, **kwargs)[source]¶ transforms X into vector of cluster affinities
..deprecated use WordClustering object instead :Parameters: * X (iterable of iterable, (length=n)) – corpus of document
- clustering (sklearn.cluster object) – instanciated clustering algorithm
Returns: Return type: np.ndarray, shape=(n, n_clusters)