如何在计算文本中某个单词的出现频率时忽略某些单词

在计算文本中单词准确性的频率时，如何忽略'a'， 'the'等单词?

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df= pd.DataFrame({'phrase': pd.Series('The large distance between cities. The small distance. The')})
f = CountVectorizer().build_tokenizer()(str(df['phrase']))
result = collections.Counter(f).most_common(1)
print result

答案将是。但是我想让distance成为使用频率最高的单词

最好避免像这样计算开头的条目。

ignore = {'the','a','if','in','it','of','or'}
result = collections.Counter(x for x in f if x not in ignore).most_common(1)

另一种选择是使用CountVectorizer的stop_words参数
这些是您不感兴趣的单词，将被分析器丢弃。

f = CountVectorizer(stop_words={'the','a','if','in','it','of','or'}).build_analyzer()(str(df['phrase']))
result = collections.Counter(f).most_common(1)
print result
[(u'distance', 1)]

请注意，tokenizer不执行预处理(小写，重音剥离)或删除停止词，因此您需要在这里使用分析器。

您也可以使用stop_words='english'自动删除英文停止词(完整列表参见sklearn.feature_extraction.text.ENGLISH_STOP_WORDS)

相关内容

最新更新

热门标签：