计算窗口内的单词和功能



我有一些复杂的计数问题。我正在尝试编写一个并发数据框,其中包含语料库中的单词行,并列这些相同的单词以及一组特征(例如复数、单数、过去时等(。

我已经开发了一本相关的词典。这些单词中的每一个都是字典,其中每个键都是一个单词或功能。这样:

WordDict={Word1 :{word1:0, word2:0 ... feature1:0, feature2:0 ...}, Word2 :{word1:0, word2:0 ... feature1:0, feature2:0 ...} ...}

我还有一个单词语料库(词形还原(:

doc=['Word1', 'Word2', 'Word3' ...]

我还有一个带有令牌及其功能的列表列表:

meh=[['Word1', 'Feature1', 'Feature2', 'Feature3'], ['Word2', 'Feature1', 'Feature2', 'Feature3', 'Feature4' ], ['Word3', 'Feature1', 'Feature3']]

理想情况下,我想要的是字典看起来像这样:

WordDict={Word1:{word1:0, word2:1 ... feature1:1, feature2:1 ...}, Word2:{word1:1, word2:0 ... feature1:1, feature2:1 ...} ...}

因为单词是引理,有些单词会在doc重复,但在WordDict中只有一个条目。本质上我需要

  1. 对于WordDict中的每个顶级键,循环访问meh.

    1a. 对于列表中观察到的每个功能,meh对于每个顶级键,在WordDict中的相关功能计数中加 +1。

  2. 对于WordDict中的每个顶级键,循环访问doc

    2a. 左边或右边每看到一个字5个单位,在相关的字数WordDict上加+1

我已经考虑为此使用某种 ngram 窗口:

def windower(list, n):
for count,ele in enumerate(list):
if count-n < 0:
window=list[0:count+n]
else:
window=list[count-n:count+n]

所以我认为从这里开始计算单词巧现,我需要一种方法将window的出现次数添加到相关的单词键中WordDict

希望有人能帮忙?

我根据您的描述编写了以下代码。

但是2.2a.对我来说感觉很奇怪。我不认为代码完全是你想要的。

wordDict = {
"word1": {
"word1": 0,
"word2": 0,
"word3": 0,
"feature1": 0,
"feature2": 0,
"feature3": 0,
},
"word2": {
"word1": 0,
"word2": 0,
"word3": 0,
"feature1": 0,
"feature2": 0,
"feature3": 0,
},
"word3": {
"word1": 0,
"word2": 0,
"word3": 0,
"feature1": 0,
"feature2": 0,
"feature3": 0,
},
}
# some will be repeated you say?
doc = ["word1", "word1", "word2", "word3"]
meh = [["word1", "feature2", "feature3"], ["word2", "feature2"], ["word3", "feature1"]]
for word, wf in wordDict.items():
# 1a starts
found = False
for m in meh:
if m[0] == word:
found = True
for f in m[1:]:
wf[f] += 1
if found:
break
# 1a ends
# 2a starts
docLen = len(doc)
for i, d in enumerate(doc):
# 5 to the left, excluding itself
for j in range(max(0, i - 5), i):
wf[doc[j]] += 1
# 5 to the right, excluding itself
for j in range(i + 1, min(i + 6, docLen)):
wf[doc[j]] += 1
# 2a ends

print(wordDict)
# {'word1': {'word1': 6, 'word2': 3, 'word3': 3, 'feature1': 0, 'feature2': 1, 'feature3': 1}, 'word2': {'word1': 6, 'word2': 3, 'word3': 3, 'feature1': 0, 'feature2': 1, 'feature3': 0}, 'word3': {'word1': 6, 'word2': 3, 'word3': 3, 'feature1': 1, 'feature2': 0, 'feature3': 0}}

最新更新