从文本文件创建Python字典,并检索每个单词的计数



我正试图从文本文件中创建一个单词词典,然后计算每个单词的实例,并能够在词典中搜索单词并接收其计数,但我停滞不前。我在将文本文件中的单词小写并删除标点符号方面遇到了最大的困难,因为否则我的计数将被取消。有什么建议吗?

f=open("C:UsersMarkDesktopjefferson.txt","r")
wc={}
words = f.read().split()
count = 0
i = 0
for line in f: count += len(line.split())
for w in words: if i < count: words[i].translate(None, string.punctuation).lower() i += 1 else: i += 1 print words
for w in words: if w not in wc: wc[w] = 1 else: wc[w] += 1
print wc['states']

这听起来像是collections.Counter:的工作

import collections
with open('gettysburg.txt') as f:
    c = collections.Counter(f.read().split())
print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)

结果:

$ python foo.py 
'Four' appears 1 times
'the' appears 9 times
There are 267 total words
The 5 most common words are [('that', 10), ('the', 9), ('to', 8), ('we', 8), ('a', 7)]

当然,这包括"自由"one_answers"这个"作为单词(注意单词中的标点符号)。此外,它还将"The"one_answers"The"视为不同的单词。此外,将文件作为一个整体处理对于非常大的文件来说可能是一种损失。

这是一个忽略标点符号和大小写的版本,在大文件上更节省内存。

import collections
import re
with open('gettysburg.txt') as f:
    c = collections.Counter(
        word.lower()
        for line in f
        for word in re.findall(r'b[^Wd_]+b', line))
print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)

结果:

$ python foo.py 
'Four' appears 0 times
'the' appears 11 times
There are 271 total words
The 5 most common words are [('that', 13), ('the', 11), ('we', 10), ('to', 8), ('here', 8)]

参考文献:

  • https://docs.python.org/2/library/re.html
  • https://docs.python.org/2/library/collections.html#collections.Counter
  • 提取整词

几点:

在Python中,始终使用以下构造来读取文件:

 with open('ls;df', 'r') as f:
     # rest of the statements

如果使用f.read().split(),则它将读取到文件的末尾。之后,你需要回到开始:

f.seek(0)

第三,你要做的部分:

for w in words: 
    if i < count: 
        words[i].translate(None, string.punctuation).lower() 
        i += 1 
    else: 
        i += 1 
        print words

您不需要在Python中保留计数器。你可以简单地做。。。

for i, w in enumerate(words): 
    if i < count: 
        words[i].translate(None, string.punctuation).lower() 
    else: 
        print words

然而,您甚至不需要在此处检查i < count。。。你可以简单地做:

words = [w.translate(None, string.punctuation).lower() for w in words]

最后,如果您只想计算states,而不想创建一个完整的项目字典,请考虑使用filter。。。。

print len(filter( lambda m: m == 'states', words ))

最后一件事。。。

如果文件很大,不建议一次把每个单词都放在内存中。考虑逐行更新wc字典。你可以考虑:

for line in f: 
    words = line.split()
    # rest of your code
File_Name = 'file.txt'
counterDict={}
with open(File_Name,'r') as fh:
    for line in fh:
   # removing their punctuation
        words = line.replace('.','').replace(''','').replace(',','').lower().split()
        for word in words:
            if word not in counterDict:
                counterDict[word] = 1
            else:
                counterDict[word] = counterDict[word] + 1
print('Count of the word > common< :: ',  counterDict.get('common',0))

最新更新