如何在Python或R中获取最常见的短语或单词

给定一些文本，如何在n = 1至6中获得最常见的n-gram？我已经看到了一次以3克或2克获取一个n的方法，但是有什么方法可以提取最有意义的最大长度短语，而其余的也是如此？p>例如，在本文中仅用于演示： fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.

N-Gram及其计数器的理想结果是：

fri evening commute: 3,
off-peak: 2,
rest of the words: 1

任何建议。谢谢。

python

考虑提供Ngrams函数的NLTK库，您可以用来迭代n。

的值

a 粗糙实现将沿以下行，其中 rugh 是这里的关键字：

from nltk import ngrams
from collections import Counter
result = []
sentence = 'fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.'
# Since you are not considering periods and treats words with - as phrases
sentence = sentence.replace('.', '').replace('-', ' ')
for n in range(len(sentence.split(' ')), 1, -1):
    phrases = []
    for token in ngrams(sentence.split(), n):
        phrases.append(' '.join(token))
    phrase, freq = Counter(phrases).most_common(1)[0]
    if freq > 1:
        result.append((phrase, n))
        sentence = sentence.replace(phrase, '')
for phrase, freq in result:
    print('%s: %d' % (phrase, freq))

至于 r

这可能有用

我建议您使用r：https：//cran.r-project.org/web/packages/udpipe/vignettes/udpipe-udpipe-udpipe-usecase-ecase-postagging-lemmatisation。html

相关内容

最新更新

热门标签：