解决你的问题

您应该使用模型的score()方法，该方法返回传入文档的日志似然。

假设您已经根据论文创建了文档，并为每个主机训练了一个LDA模型。然后，您应该从所有培训文档中获得最低可能性，并将其用作阈值。示例未测试代码如下:

import numpy as np
from sklearn.decomposition import LatentDirichletAllocation
# Assuming X contains a host's training documents
# and X_unknown contains the test documents
lda = LatentDirichletAllocation(... parameters here ...)
lda.fit(X)
threshold = min([lda.score([x]) for x in X])
attacks = [
    i for i, x in enumerate(X_unknown)
    if lda.score([x]) < threshold
]
# attacks now contains the indexes of the anomalies

正是你所要求的

如果你想在你链接的论文中使用精确方程，我建议不要在scikit-learn中尝试这样做，因为期望步骤界面不清楚。

参数θ和φ可以在第112 ~ 130行作为doc_topic_d和norm_phi找到。函数_update_doc_distribution()返回doc_topic_distribution和足够的统计信息，您可以通过以下同样未经测试的代码尝试推断θ和φ:

theta = doc_topic_d / doc_topic_d.sum()
# see the variables exp_doc_topic_d in the source code
# in the function _update_doc_distribution()
phi = np.dot(exp_doc_topic_d, exp_topic_word_d) + EPS

对其他库的建议

如果你想对期望和最大化步骤以及变分参数有更多的控制，我建议你看看LDA++，特别是EStepInterface(免责声明，我是LDA++的作者之一)。

潜在狄利克雷分配的Sklearn似然

解决你的问题

正是你所要求的

对其他库的建议

相关内容

最新更新

热门标签：