[using Python 3.3.3]
我试图分析文本文件,清理它们,打印唯一单词的数量,然后尝试将唯一单词列表保存到文本文件,每行一个单词,每个唯一单词在清理后的单词列表中出现的次数。所以我所做的是我拿了文本文件(哈珀总理的演讲),通过只计算有效的字母字符和单个空格来清理它,然后我计算唯一单词的数量,然后我需要保存唯一单词的文本文件,每个唯一单词都在自己的行上,在单词旁边,这个单词在清理列表中出现的次数。这是我有的。
def uniqueFrequency(newWords):
'''Function returns a list of unique words with amount of occurances of that
word in the text file.'''
unique = sorted(set(newWords.split()))
for i in unique:
unique = str(unique) + i + " " + str(newWords.count(i)) + "n"
return unique
def saveUniqueList(uniqueLines, filename):
'''Function saves result of uniqueFrequency into a text file.'''
outFile = open(filename, "w")
outFile.write(uniqueLines)
outFile.close
newWords是文本文件的清理版本,只有单词和空格,没有其他内容。因此,我希望将newWords文件中的每个唯一单词保存到一个文本文件中,每行一个单词,并且在单词旁边,该单词在newWords中出现的次数为#(不是在唯一单词列表中,因为那样每个单词将出现1次)。我的函数有什么问题?谢谢你!
unique = str(unique) + i + " " + str(newWords.count(i)) + "n"
上面的行,是附加在现有集合的末尾- "unique",如果您使用其他变量名代替,如"var",应该会正确返回。
def uniqueFrequency(newWords):
'''Function returns a list of unique words with amount of occurances of that
word in the text file.'''
var = "";
unique = sorted(set(newWords.split()))
for i in unique:
var = str(var) + i + " " + str(newWords.count(i)) + "n"
return var
基于
unique = sorted(set(newWords.split()))
for i in unique:
unique = str(unique) + i + " " + str(newWords.count(i)) + "n"
我猜newWords
不是字符串列表,而是一个长字符串。如果是这种情况,newWords.count(i)
将为每个i
返回0
。
试题:
def uniqueFrequency(newWords):
'''Function returns a list of unique words with amount of occurances of that
word in the text file.'''
wordList = newWords.split()
unique = sorted(set(wordList))
ret = ""
for i in unique:
ret = ret + i + " " + str(wordList.count(i)) + "n"
return ret
试试collections.Counter
吧。它是为这种情况而设计的。
下面的ippython演示:
In [1]: from collections import Counter
In [2]: txt = """I'm trying to analyse text files, clean them up, print the amount of unique words, then try to save the unique words list to a text file, one word per line with the amount of times each unique word appears in the cleaned up list of words. SO what i did was i took the text file (a speech from prime minister harper), cleaned it up by only counting valid alphabetical characters and single spaces, then i counted the amount of unique words, then i needed to make a saved text file of the unique words, with each unique word being on its own line and beside the word, the number of occurances of that word in the cleaned up list. Here's what i have."""
In [3]: Counter(txt.split())
Out[3]: Counter({'the': 10, 'of': 7, 'unique': 6, 'i': 5, 'to': 4, 'text': 4, 'word': 4, 'then': 3, 'cleaned': 3, 'up': 3, 'amount': 3, 'words,': 3, 'a': 2, 'with': 2, 'file': 2, 'in': 2, 'line': 2, 'list': 2, 'and': 2, 'each': 2, 'what': 2, 'did': 1, 'took': 1, 'from': 1, 'words.': 1, '(a': 1, 'only': 1, 'harper),': 1, 'was': 1, 'analyse': 1, 'one': 1, 'number': 1, 'them': 1, 'appears': 1, 'it': 1, 'have.': 1, 'characters': 1, 'counted': 1, 'list.': 1, 'its': 1, "I'm": 1, 'own': 1, 'by': 1, 'save': 1, 'spaces,': 1, 'being': 1, 'clean': 1, 'occurances': 1, 'alphabetical': 1, 'files,': 1, 'counting': 1, 'needed': 1, 'that': 1, 'make': 1, "Here's": 1, 'times': 1, 'print': 1, 'up,': 1, 'beside': 1, 'trying': 1, 'on': 1, 'try': 1, 'valid': 1, 'per': 1, 'minister': 1, 'file,': 1, 'saved': 1, 'single': 1, 'words': 1, 'SO': 1, 'prime': 1, 'speech': 1, 'word,': 1})
(注意,这个解决方案还不完美;它没有去掉单词中的逗号。提示;使用str.replace
.)
Counter
是一个特殊的dict
,以单词作为键,计数作为值。所以你可以这样使用:
cnts = Counter(txt)
with open('counts.txt', 'w') as outfile:
for c in counts:
outfile.write("{} {}n".format(c, cnts[c]))
注意,在这个解决方案中,我使用了一些很好理解的Python概念;
- 上下文管理器
- 迭代
dict
(这是一个迭代器) -
str.format