小贝子编程

Mallet: Tokenization by N-grams (1,2)

本文关键字：N-grams Tokenization by Mallet topic-modeling n-gram mallet
更新时间 : 2023-09-23
英文 : Mallet: Tokenization by N-grams (1,2)

我想知道是否可以将Mallet中的单词按n-gram大小在1到2之间进行标记?

这是我目前使用的代码:

binmallet import-dir --input sample-dataweben --output sample.txt --keep-sequence-bigrams --remove-stopwords
binmallet train-topics  --input sample.txt  --num-topics 20 --optimize-interval 10 --output-doc-topics sample_composition.txt --output-topic-keys sample_keys.txt

提前谢谢你。

主题模型训练器不使用双元特征，这会使代码更加复杂。添加双字符的两种方法是在导入输入数据文件之前修改它，例如

the cat sat

将成为

the cat sat the_cat cat_sat

您还可以创建一个post-hoc报告，用于识别经常一起出现的单词对和与--xml-topic-phrase-report FILENAME分配到相同的主题。

Mallet: Tokenization by N-grams (1,2)

相关内容

最新更新

热门标签：