忽略少于 x 个字符的字符串的 Python 3 计数器



我有一个计算文本文件单词的程序。现在我想将计数器限制为超过 x 个字符的字符串

from collections import Counter
input = 'C:/Users/micha/Dropbox/IPCC_Boox/FOD_v1_ch15.txt'
Counter = {}
words = {}
with open(input,'r', encoding='utf-8-sig') as fh:
for line in fh:
word_list = line.replace(',','').replace(''','').replace('.','').lower().split()
for word in word_list:
if word not in Counter:
Counter[word] = 1
else:
Counter[word] = Counter[word] + 1
N = 20
top_words = Counter(Counter).most_common(N)
for word, frequency in top_words:
print("%s %d" % (word, frequency))

我尝试了re代码,但它不起作用。

re.sub(r'bw{1,3}b')

我不知道如何实现它...

最后,我希望有一个忽略所有简短单词的输出,例如和,你,是等。

你可以用以下方法更简单地做到这一点:

for word in word_list:
if len(word) < 5:   # check the length of each word is less than 5 for example
continue        # this skips the counter portion and jumps to next word in word_list
elif word not in Counter:
Counter[word] = 1
else:
Counter[word] = Counter[word] + 1

几个注释。

1(您导入了一个Counter但没有正确使用它(您执行了Counter = {}从而覆盖了导入(。

from collections import Counter

2(而不是用set做几个replace使用列表推导,它更快,只做一次(两次连接(迭代,而不是几次:

sentence = ''.join([char for char in line if char not in {'.', ',', "'"}])
word_list = sentence.split()

3( 使用计数器和列表补偿的长度:

c = Counter(word for word in word_list if len(word) > 3)

就是这样。

计数器已经做了你想要的。您可以使用可迭代对象"喂"它,这将起作用。 https://docs.python.org/2/library/collections.html#counter-objects 您也可以使用过滤功能 https://docs.python.org/3.7/library/functions.html#filter 可能看起来很相似:

counted = Counter(filter(lambda x: len(x) >= 5, words))

最新更新