Toggle Menu

sklearn.decomposition.LatentDirichletAllocation¶

class sklearn.decomposition.LatentDirichletAllocation(n_components=10, *, doc_topic_prior=None, topic_word_prior=None, learning_method='batch', learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None)

基于在线变分贝叶斯算法的潜在狄利克雷分解

新版本为0.17。

在“用户指南”中阅读更多内容

参数	说明
n_components	int, optional (default=10) 数量的话题。在版本0.19中更改:n_topics ' '被重命名为' ' n_components
doc_topic_prior	float, optional (default=None) 之前的主题词分布`theta`。如果值为None，则默认为`1 / n_components`。在[Re25e5648fc37-1]中，这叫做`alpha`.
topic_word_prior	float, optional (default=None) 之前的主题词分布beta。如果值为None，则默认为`1 / n_components`。在[Re25e5648fc37-1]中，这被称为`eta`。
learning_method	'batch'/‘online’, default=’batch' 用于更新`_component`的方法。仅在`fit`中使用。通常，如果数据量很大，在线更新会比批量更新快得多. 有效的选项: “batch”:批量变分贝叶斯方法。在每个EM更新中使用所有的训练数据旧的“components_”将在每次迭代中被覆盖。 “online”: 在线变分贝叶斯方法。在每个EM更新中，使用mini-batch更新' ' components_ ' '的训练数据变量增量。学习率是由' ' learning_decay ' '和' ' learning_offset ' '参数控制。在0.20版本中改变:默认的学习方法现在是“batch”。
learning_decay	float, optional (default=0.7) 它是在线学习方法中控制学习率的一个参数。为保证渐近收敛，取值应在(0.5,1.0)之间。当值为0.0,`batch_size`为`n_samples`时，更新方法与批量学习相同。在这篇文献中，被称为kappa。
learning_offset	float, optional (default=10.) 一个(正的)参数，降低在线学习的早期迭代。它应该大于1.0。在文献中，这叫做tau_0。
max_iter	integer, optional (default=10) 最大迭代次数。
batch_size	int, optional (default=128) 在每次EM迭代中使用的文档数量。仅用于在线学习。
evaluate_every	int, optional (default=0) 评估困惑频率。仅在`fit`法中使用。将其设置为0 或负数，在训练中完全不评估perplexity。评估perplexity可以帮助你检查训练过程中的收敛性，但也会增加训练的总时间。在每次迭代中评估复杂性可能会将训练时间增加两倍。
total_samples	int, optional (default=1e6) 文件总数。仅用于`partial_fit`方法。
perp_tol	float, optional (default=1e-1) 批量学习中的困惑容忍度。仅在`evaluate_every`大于0时使用。
mean_change_tol	float, optional (default=1e-3) 停止E-step中更新文档主题分发的容忍度。
max_doc_update_iter	int (default=100) E-step中更新文档主题分布的最大迭代次数。
n_jobs	int or None, optional (default=None) 在E-step中使用的作业数量。None就是1，除非在`joblib.parallel_backend` 上下文。`-1`表示使用所有处理器。更多细节请参见Glossary。
verbose	int, optional (default=0) 冗长的水平。
random_state	int, RandomState instance, default=None 在多个函数调用中传递可重复的结果。参见Glossary。

属性	说明
components_	array, [n_components, n_features] 主题词分布的变分参数。自完整的词分布狄利克雷条件为话题,`components_ (i, j)`可以被视为`pseudocount`代表单词的次数`j`,我被分配到的话题。它也可以被视为分布归一化后的文字为每个主题:`model.components_ / model.components_.sum(axis= 1):,np.newaxis]`。
n_batch_iter_	int EM步骤的迭代次数。
n_iter_	int 传递数据集的次数。
bound_	float 训练集最终perplexity得分。
doc_topic_prior_	float 之前的主题词分布theta。。如果值为None，则为1 / n_components。
topic_word_prior_	float 之前的主题词分布beta。如果值为None，则为1 / n_components。

参考文献：

Re25e5648fc37-1(1,2)

“Online Learning for Latent Dirichlet Allocation”, Matthew D. Hoffman, David M. Blei, Francis Bach, 2010
[2] “Stochastic Variational Inference”, Matthew D. Hoffman, David M. Blei,

Chong Wang, John Paisley, 2013
[3] Matthew D. Hoffman’s onlineldavb code. Link:

https://github.com/blei-lab/onlineldavb

示例：

>>> from sklearn.decomposition import LatentDirichletAllocation
>>> from sklearn.datasets import make_multilabel_classification
>>> # This produces a feature matrix of token counts, similar to what
>>> # CountVectorizer would produce on text.
>>> X, _ = make_multilabel_classification(random_state=0)
>>> lda = LatentDirichletAllocation(n_components=5,
...     random_state=0)
>>> lda.fit(X)
LatentDirichletAllocation(...)
>>> # get topics for some given samples:
>>> lda.transform(X[-2:])
array([[0.00360392, 0.25499205, 0.0036211 , 0.64236448, 0.09541846],
       [0.15297572, 0.00362644, 0.44412786, 0.39568399, 0.003586  ]])

方法：

方法	说明
`fit`(self, X[, y])	用变分贝叶斯方法学习数据X的模型。
`fit_transform`(self, X[, y])	拟合数据，然后转换它。
`get_params`(self[, deep])	获取这个估计器的参数。
`partial_fit`(self, X[, y])	在线VB与mini-batch更新。
`perplexity`(self, X[, sub_sampling])	计算数据X的近似perplexity。
`score`(self, X[, y])	计算近似对数似然作为分数。
`set_params`(self, **params)	设置这个估计器的参数。
`transform`(self, X)	根据拟合模型变换数据X。

__init__(n_components=10, *, doc_topic_prior=None, topic_word_prior=None, learning_method='batch', learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None)

初始化self. 请参阅help(type(self))以获得准确的说明。

fit(self, X, y=None)

用变分贝叶斯方法学习数据X的模型。

当learning_method是“在线”时，使用小批量更新。否则，使用批处理更新。

参数	说明
X	array-like or sparse matrix, shape=(n_samples, n_features) 文档词矩阵。
y	Ignored

返回值	说明
self	无

fit_transform(self, X, y=None, *fit_params)

拟合数据，然后转换它。

使用可选参数fit_params将transformer与X和y匹配，并返回X的转换版本。

参数	说明
X	{array-like, sparse matrix, dataframe} of shape (n_samples, n_features)
y	ndarray of shape (n_samples,), default=None 目标值
**fit_params	dict 其他拟合参数。

返回值	说明
X_new	ndarray array of shape (n_samples, n_features_new) Transformed array.

get_params(self, deep=True)

获取这个估计器的参数。

参数	说明
deep	bool, default=True 如果为真，将返回此估计器的参数以及包含的作为估计器的子对象。

返回值	说明
params	mapping of string to any 参数名称映射到它们的值。

partial_fit(self, X, y=None)

在线VB与Mini-Batch更新。

参数	说明
X	array-like or sparse matrix, shape=(n_samples, n_features) 文档词矩阵。
y	Ignored

返回值	说明书
self	无

perplexity(self, X, sub_sampling=False)

计算数据X的近似perplexity。

Perplexity定义为exp(-1. * log-likelihood per word)

Changed in version 0.19: doc_topic_distr argument has been deprecated and is ignored because user no longer has access to unnormalized distribution

参数	说明
X	array-like or sparse matrix, [n_samples, n_features] 文档词矩阵。
sub_sampling	bool Do sub-sampling or not.

返回值	说明
score	float 困惑度分数。

score(self, X, y=None)

计算近似对数似然作为分数。

参数	说明
X	array-like or sparse matrix, shape=(n_samples, n_features) 文档词矩阵。
y	Ignored

返回值	说明
score	float 使用近似边界作为分数。

set_params(self, *params)

设置这个估计器的参数。

该方法适用于简单估计器和嵌套对象(如管道)。后者具有形式为__的参数，这样就可以更新嵌套对象的每个样本。

参数	说明
**params	dict 估计器参数

返回值	说明
self	object 估计器实例

transform(self, X)

根据拟合模型变换数据X。

Changed in version 0.18: doc_topic_distr is now normalized

参数	说明
X	array-like or sparse matrix, shape=(n_samples, n_features) 文档词矩阵。

返回值	说明
doc_topic_distr	shape=(n_samples, n_components) X的文档主题分发。

示例 sklearn.decomposition.LatentDirichletAllocation¶

非负矩阵分解与潜在Dirichlet分配的主题提取

非负矩阵分解与潜在Dirichlet分配的主题提取 ¶

加入交流群
备注:机器学习