在Mahout中从文本创建矢量时出现问题

我使用Mahout 0.9（安装在HDP 2.2上）进行主题发现（潜在Drichlet分配算法）。我的文本文件存储在目录中inputraw并按顺序执行以下命令

命令#1:

mahout seqdirectory -i inputraw -o output-directory -c UTF-8

命令#2：

mahout seq2sparse -i output-directory -o output-vector-str -wt tf -ng 3 --maxDFPercent 40 -ow -nv

命令#3：

mahout rowid -i output-vector-str/tf-vectors/ -o output-vector-int

命令#4：

mahout cvb -i output-vector-int/matrix -o output-topics -k 1 -mt output-tmp -x 10 -dict output-vector-str/dictionary.file-0

在执行第二个命令后，如预期的那样，它在output-vector-str（命名为df-count、dictionary.file-0、frequency.file-0、tf-vectors、tokenized-documents和wordcount）。考虑到我的输入文件的大小，这些文件的大小看起来都不错，但"tf vectors"下的文件大小非常小，实际上只有118个字节）。

显然是

`tf-vectors` is the input to the 3rd command, the third command also generates a file of small size. Does anyone know:

下的文件是什么原因

`tf-vectors` folder to be that small? There must be something wrong.

从第一个命令开始，所有生成的文件都有一个奇怪的编码，也不是人类可读的。这是意料之中的事吗？

您的答案如下：

tf vectors文件夹下的文件这么小是什么原因

考虑到您给定的maxdf百分比仅为40%，向量很小，这意味着只有doc-freq（整个文档中出现的术语的百分比freq）小于40%的术语才会被考虑。换句话说，在生成向量时，只考虑出现在40%或更少文档中的术语。

tf vectors文件夹下的文件这么小是什么原因

mahout中有一个名为mahout seqdumper的命令，它可以帮助您将文件以"顺序"格式转储为"人类"可读格式。祝你好运

相关内容

最新更新

热门标签：