嗨,所以我有 2 个文本文件,我必须读取第一个文本文件,计算每个单词的频率并删除重复项并创建一个列表列表,其中包含文件中的单词及其计数。
我的第二个文本文件包含关键字,我需要在第一个文本文件中计算这些关键字的频率,并在不使用任何导入、字典或 zip 的情况下返回结果。
我被困在如何进行第二部分,我打开了文件并删除了标点符号等,但我不知道如何找到频率 我玩弄了.find()
的想法,但还没有运气。
任何建议将不胜感激,这是我目前的代码似乎在关键字文件中找到关键字的频率,而不是在第一个文本文件中
def calculateFrequenciesTest(aString):
listKeywords= aString
listSize = len(listKeywords)
keywordCountList = []
while listSize > 0:
targetWord = listKeywords [0]
count =0
for i in range(0,listSize):
if targetWord == listKeywords [i]:
count = count +1
wordAndCount = []
wordAndCount.append(targetWord)
wordAndCount.append(count)
keywordCountList.append(wordAndCount)
for i in range (0,count):
listKeywords.remove(targetWord)
listSize = len(listKeywords)
sortedFrequencyList = readKeywords(keywordCountList)
return keywordCountList;
编辑 - 目前正在考虑再次重新打开我的第一个文件的想法,但这次没有将其变成列表?我认为我的错误以某种方式来自计算我的列表列表的频率。这些是我得到的结果类型
[[['the', 66], 1], [['of', 32], 1], [['and', 27], 1], [['a', 23], 1], [['i', 23], 1]]
你可以尝试这样的事情:
我以单词列表为例。
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = {}
for word in word_list:
if word not in frequency_list:
frequency_list[word] = 1
else:
frequency_list[word] += 1
print(frequency_list)
RESULT: {'test': 1, 'world': 1, 'hello': 2}
既然你对字典施加了约束,我就利用两个列表来做同样的任务。我不确定它的效率如何,但它达到了目的。
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = []
frequency_word = []
for word in word_list:
if word not in frequency_word:
frequency_word.append(word)
frequency_list.append(1)
else:
ind = frequency_word.index(word)
frequency_list[ind] += 1
print(frequency_word)
print(frequency_list)
RESULT : ['hello', 'world', 'test']
[2, 1, 1]
您可以将其更改为您喜欢的方式或根据需要重构它
我同意@bereal的观点,你应该为此使用Counter
。我看到你说过你不想要"进口、字典或拉链",所以请随意忽略这个答案。然而,Python 的主要优势之一是它出色的标准库,每次您有可用的list
库时,您还将拥有dict
、collections.Counter
和re
。
从你的代码中,我得到的印象是你想使用与 C 或 Java 相同的风格。我建议尝试更蟒蛇一点。以这种方式编写的代码可能看起来很陌生,并且可能需要时间来适应。然而,你会学到更多。
确定你想要实现的目标会有所帮助。你在学习Python吗?你正在解决这个具体问题吗?为什么不能使用任何导入、字典或压缩?
因此,这里有一个利用内置功能(没有第三方)的建议(用Python 2测试):
#!/usr/bin/python
import re # String matching
import collections # collections.Counter basically solves your problem
def loadwords(s):
"""Find the words in a long string.
Words are separated by whitespace. Typical signs are ignored.
"""
return (s
.replace(".", " ")
.replace(",", " ")
.replace("!", " ")
.replace("?", " ")
.lower()).split()
def loadwords_re(s):
"""Find the words in a long string.
Words are separated by whitespace. Only characters and ' are allowed in strings.
"""
return (re.sub(r"[^a-z']", " ", s.lower())
.split())
# You may want to read this from a file instead
sourcefile_words = loadwords_re("""this is a sentence. This is another sentence.
Let's write many sentences here.
Here comes another sentence.
And another one.
In English, we use plenty of "a" and "the". A whole lot, actually.
""")
# Sets are really fast for answering the question: "is this element in the set?"
# You may want to read this from a file instead
keywords = set(loadwords_re("""
of and a i the
"""))
# Count for every word in sourcefile_words, ignoring your keywords
wordcount_all = collections.Counter(sourcefile_words)
# Lookup word counts like this (Counter is a dictionary)
count_this = wordcount_all["this"] # returns 2
count_a = wordcount_all["a"] # returns 1
# Only look for words in the keywords-set
wordcount_keywords = collections.Counter(word
for word in sourcefile_words
if word in keywords)
count_and = wordcount_keywords["and"] # Returns 2
all_counted_keywords = wordcount_keywords.keys() # Returns ['a', 'and', 'the', 'of']
这是一个没有导入的解决方案。它使用嵌套线性搜索,在较小的输入数组上进行少量搜索是可以接受的,但对于较大的输入,这些搜索将变得笨拙且缓慢。
这里的输入仍然很大,但它在合理的时间内处理它。我怀疑如果您的关键字文件更大(我的只有 3 个单词),速度就会开始显示。
在这里,我们获取一个输入文件,遍历行并删除标点符号,然后按空格拆分并将所有单词扁平化为一个列表。该列表具有重复,因此为了删除它们,我们对列表进行排序,以便将重复组合在一起,然后迭代它,创建一个包含字符串和计数的新列表。我们可以通过在列表中出现相同单词时增加计数并在看到新单词时移动到新条目来做到这一点。
现在您有了词频列表,您可以在其中搜索所需的关键字并检索计数。
输入文本文件在这里,关键字文件可以用文件中的几个单词拼凑在一起,每行一个。
Python3 代码,它指示在适用的情况下如何修改 Python 2。
# use string.punctuation if you are somehow allowed
# to import the string module.
translator = str.maketrans('', '', '!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~')
words = []
with open('hamlet.txt') as f:
for line in f:
if line:
line = line.translate(translator)
# py 2 alternative
#line = line.translate(None, string.punctuation)
words.extend(line.strip().split())
# sort the word list, so instances of the same word are
# contiguous in the list and can be counted together
words.sort()
thisword = ''
counts = []
# for each word in the list add to the count as long as the
# word does not change
for w in words:
if w != thisword:
counts.append([w, 1])
thisword = w
else:
counts[-1][1] += 1
for c in counts:
print('%s (%d)' % (c[0], c[1]))
# function to prevent need to break out of nested loop
def findword(clist, word):
for c in clist:
if c[0] == word:
return c[1]
return 0
# open keywords file and search for each word in the
# frequency list.
with open('keywords.txt') as f2:
for line in f2:
if line:
word = line.strip()
thiscount = findword(counts, word)
print('keyword %s appear %d times in source' % (word, thiscount))
如果您倾向于修改findword
以使用二进制搜索,但它仍然不会接近dict
。 当没有限制时,collections.Counter
是正确的解决方案。它更快,更少的代码。