我正在完成一个NLP练习,需要一些帮助,什么是最好的方法来得到我的结果。我有两个文本文件,一个是单词列表,就像词汇表一样,另一个是文章。我需要计算输入文章中文本文件列表中每个单词的频率。
我正试着一步一步地做,这样我就能提高我的技能。
我已经导入了文本,对两个文件中的单词进行了标记/拆分,现在我将文章中的单词放入字典中。
我的下一步是找到字典和单词列表文本文件的交集(我假设),并返回我的文章中存在多少个单词条目的频率。
wordlist = terms.split()
splittext = input_article.split()
freq = {}
for term in splittext:
if term in freq:
freq[term] += 1
else: freq[term] = 1
#print(freq)
result = {i for i in wordlist if i in freq.keys()}
print(result)
这是我到目前为止写的,但它是最后一行,让我卡住了。我把文章中的所有单词都查在字典里了。现在我想返回输入条目中每个术语表条目的频率。
关于如何做到这一点有什么建议吗?
根据我的理解,这应该是可行的:
text = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum"
key = "? Lorem Ipsum more was not the with 123 test notin desktop"
dict = {}
dict2 = {}
words = text.split(" ")
keys = key.split(" ")
for word in words:
if word in dict:
dict[word] += 1
else:
dict[word] = 1
for i in range(len(keys)):
if keys[i] in dict.keys():
print("Key: {} freq: {}".format(keys[i], dict[keys[i]]))
dict2[keys[i]] = dict[keys[i]]
print(dict2)
输出:
{'Lorem': 4, 'Ipsum': 4, 'more': 1, 'was': 1, 'not': 1, 'the': 6, 'with': 2, 'desktop': 1}