根据如下:
basis = "Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ay."
和以下文字:
words = "word, text, bank, tree"
我如何计算"words"中每个单词的pmi值与"basis"中的每个单词相比,我可以使用上下文窗口大小为5(即目标单词前后两个位置)?
我知道如何计算PMI,但不知道如何处理上下文窗口的事实。
我计算"正常"pmi值如下:
def PMI(ContingencyTable):
(a,b,c,d,N) = ContingencyTable
# avoid log(0)
a += 1
b += 1
c += 1
d += 1
N += 4
R_1 = a + b
C_1 = a + c
return log(float(a)/(float(R_1)*float(C_1))*float(N),2)
我在PMI上做了一点搜索,看起来那里有重型包装,"窗口"包括
在PMI中,"互"似乎指的是两个不同单词的联合概率,所以你需要根据问题陈述来巩固这个想法
我承担了一个较小的问题,就是在你的问题声明中生成短窗口列表,主要是为了我自己的练习
def wndw(wrd_l, m_l, pre, post):
"""
returns a list of all lists of sequential words in input wrd_l
that are within range -pre and +post of any word in wrd_l that matches
a word in m_l
wrd_l = list of words
m_l = list of words to match on
pre, post = ints giving range of indices to include in window size
"""
wndw_l = list()
for i, w in enumerate(wrd_l):
if w in m_l:
wndw_l.append([wrd_l[i + k] for k in range(-pre, post + 1)
if 0 <= (i + k ) < len(wrd_l)])
return wndw_l
basis = """Each word of the text is converted as follows: move any
consonant (or consonant cluster) that appears at the start
of the word to the end, then append ay."""
words = "word, text, bank, tree"
print(*wndw(basis.split(), [x.strip() for x in words.split(',')], 2, 2),
sep="n")
['Each', 'word', 'of', 'the']
['of', 'the', 'text', 'is', 'converted']
['of', 'the', 'word', 'to', 'the']