从较大的语料库中创建dict

我有一个30000条消息的语料库。

corpus = [
"hello world", 
"i like mars", 
"a planet called venus", 
... , 
"it's all pcj500"]

我已经将它们标记化，并形成了一个包含所有唯一单词的word_set。

word_lists = [text.split(" ") for text in corpus]
>>> [['hello', 'world'],
['i', 'like', 'mars'],
['a', 'planet', 'called', 'venus'],
...,
["it's", 'all', 'pcj500']]
word_set = set().union(*word_lists)
>>> ['hello', 'world', 'i', 'like', ..., 'pcj500']

我正在尝试创建一个字典列表，其中word in the word_set为键，初始值为计数的0
如果word in word_set出现在word_list in word_lists中适当计数为值

对于步骤1，我是这样做的，

tmp = corpus[:10]
word_dicts = []
for i in range(len(tmp)):
word_dicts.append(dict.fromkeys(list(word_set)[:30], 0))
word_dicts
>>> [{'hello': 0,
'world': 0,
'mars': 0,
'venus': 0,
'explore': 0,
'space': 0}]

问题：

如何针对word_set中的所有项目对语料库中的所有文本执行dict.fromkeys操作？对于整个语料库，我的记忆力都快用完了。应该有更好的方法来做这件事，但我自己找不到。

您可以使用collections中的defaultdict或Counter，它们使用惰性键。示例：

from collections import Counter
word_dicts = []
for words_list in word_lists:
word_dicts.append(Counter(words_list))

相关内容

最新更新

热门标签：