使用给定的上下文窗口计算PMI值



根据如下:

basis = "Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ay."

和以下文字:

words = "word, text, bank, tree"

我如何计算"words"中每个单词的pmi值与"basis"中的每个单词相比,我可以使用上下文窗口大小为5(即目标单词前后两个位置)?

我知道如何计算PMI,但不知道如何处理上下文窗口的事实。

我计算"正常"pmi值如下:

def PMI(ContingencyTable):
    (a,b,c,d,N) = ContingencyTable
    # avoid log(0)
    a += 1
    b += 1
    c += 1
    d += 1
    N += 4
    R_1 = a + b
    C_1 = a + c
    return log(float(a)/(float(R_1)*float(C_1))*float(N),2)

我在PMI上做了一点搜索,看起来那里有重型包装,"窗口"包括

在PMI中,"互"似乎指的是两个不同单词的联合概率,所以你需要根据问题陈述来巩固这个想法

我承担了一个较小的问题,就是在你的问题声明中生成短窗口列表,主要是为了我自己的练习

def wndw(wrd_l, m_l, pre, post):
    """
    returns a list of all lists of sequential words in input wrd_l
    that are within range -pre and +post of any word in wrd_l that matches
    a word in m_l
    wrd_l      = list of words
    m_l        = list of words to match on
    pre, post  = ints giving range of indices to include in window size      
    """
    wndw_l = list()
    for i, w in enumerate(wrd_l):
        if w in m_l:
           wndw_l.append([wrd_l[i + k] for k in range(-pre, post + 1)
                                           if 0 <= (i + k ) < len(wrd_l)])
    return wndw_l
basis = """Each word of the text is converted as follows: move any
             consonant (or consonant cluster) that appears at the start
             of the word to the end, then append ay."""
words = "word, text, bank, tree"
print(*wndw(basis.split(), [x.strip() for x in words.split(',')], 2, 2),
      sep="n")
['Each', 'word', 'of', 'the']
['of', 'the', 'text', 'is', 'converted']
['of', 'the', 'word', 'to', 'the']

最新更新