需要帮助弄清楚zip()，*[..]和.update()是如何工作的

程序应该读取一个文件，然后输出N-gram的频率。经过一些研究，我已经弄清楚了大部分代码。我唯一不明白的部分是：combination= (zip(*[words[i:] for i in range(n)]))和c.update(combination).zip函数我知道它返回一个元组列表，但我不明白为什么它的参数中有一个for循环。

from collections import Counter
filename = r'/Users/ma/desktop/dd.txt'
textfile = open(filename, 'r')
c = Counter()
def n_grams(n):
  for line in textfile:
       words = line.split()
       combination= (zip(*[words[i:] for i in range(n)]))
       c.update(combination)
  return c
n = int(raw_input('Enter the sequence of words.'))
m= n_grams(n)

zip的作用是它将两个lists连接在一起，形成一个新的tuples list。
直接取自 zip 文档：

>>> x = [1, 2, 3]
>>> y = [4, 5, 6]
>>> zipped = zip(x, y)
>>> zipped
[(1, 4), (2, 5), (3, 6)]

*[words[i:] for i in range(n)]的结果对您来说有点

难以可视化，因为我不知道实际数据是什么，我只知道它可能包含什么：

下面是代码的过度简化，以使其更具可读性：

for line in textfile:
    words = line.split() # Splits on each <space>: 'my mom' will be ['my', 'mom']
    words_to_work_with = []
    for i in range(3):
        words_to_work_with.append(words[i:])
    combination=zip(*words_to_work_with)
    c.update(combination)

循环访问文本文件中的每一行，将行拆分为多个部分(在空格上拆分(。
然后我们获取该行上带有开始偏移量的单词，并将其添加到列表中，如下所示：

words_to_work_with = []
row = ['your', 'car', 'is', 'cooler', 'than', 'mine']
words_to_work_with.append(row[0:])
words_to_work_with.append(row[1:])
words_to_work_with.append(row[2:])
words_to_work_with == [('your', 'car', 'is', 'cooler', 'than', 'mine'), ('car', 'is', 'cooler', 'than', 'mine'), ('is', 'cooler', 'than', 'mine')]

最后一部分的作用是，它通过预先添加*来分解称为words_to_work_with list。这基本上可以转化为：

zip(('your', 'car', 'is', 'cooler', 'than', 'mine'), ('car', 'is', 'cooler', 'than', 'mine'), ('is', 'cooler', 'than', 'mine'))

而不是：

zip([('your', 'car', 'is', 'cooler', 'than', 'mine'), ('car', 'is', 'cooler', 'than', 'mine'), ('is', 'cooler', 'than', 'mine')])

注意到区别了吗？第一种情况下我们传递 3 个参数，在第二种情况下，我们只发送一个大列表作为我们的参数。Zip 需要多个列表才能加入。结果将是一个新的列表，我猜每个单词只有一个正确顺序的实例。看起来像这样：

>>> list(zip(('your', 'car', 'is', 'cooler', 'than', 'mine'), ('car', 'is', 'cooler', 'than', 'mine'), ('is', 'cooler', 'than', 'mine')))
[('your', 'car', 'is'), ('car', 'is', 'cooler'), ('is', 'cooler', 'than'), ('cooler', 'than', 'mine')]

您显示的代码计算"n-gram"，它们是文本中n相邻单词的序列。例如，如果文本是"A B C D E"的，而您正在查看 3 克，您需要分别数一次("A", "B", "C")、("B", "C", "D")和("C", "D", "E")。

它这样做的方式有些棘手，而且它可能有一些错误。

关键部分是行：

combination= (zip(*[words[i:] for i in range(n)]))

让我们从内到外解决这个问题。

内部是一个列表理解：[words[i:] for i in range(n)]

理解创建words列表的n切片，每个切片比前一个多跳过一个单词。第一个值是完整的单词列表，第二个值跳过第一个单词，第三个值跳过两个单词，依此类推(最多跳过 n-1 个值(。

下一部分是对 zip 的调用，它在其语法中使用*将上面创建的列表解压缩为单独的参数。像 func(*some_list) 这样的函数调用等效于 func(some_list[0], some_list[1], ...)(其中 ... 表示"其余列表项以此类推"(。它是调用具有未知参数数量的函数的有用语法。

那么，zip电话是做什么的呢？ zip接受任意数量的可迭代参数，并并行迭代它们，从每个参数中获取一个项目并将它们打包到一个元组中，然后继续下一个。

在这种特定情况下，可迭代对象都是同一单词列表的切片，因此您最终会得到列表中的单词元组。由于每个列表都与前一个列表偏移了一个单词，因此您最终会得到按顺序显示的单词。这些是你正在寻找的n-gram！

在zip调用之外，还有一组额外的括号，它们实际上没有任何作用。您可能应该删除这些内容。

无论如何，该算法是获取n-grams的一种有点复杂的方法，尽管它具有一定的优雅性。更直接的方法是处理索引并直接切出 n 元语法：

ngrams = [tuple(words[i:i+n]) for i in range(len(words)-n+1)]

我不知道这会更快还是更慢，但对我来说它正在做什么似乎更明显，所以无论如何你可能更喜欢它。

无论如何，您问的最后一件事是打电话给update ，n-grams列表被传递给它。这是一种c的方法，一个collections.Counter实例，由于某种原因，它是一个全局变量。 update将提供的n-gram添加到Counter，顾名思义，计算它们。如果文本中有重复的 n-gram，则每个 n 元语法的计数将以 c 为单位相加。

但是，c是全球性的，这有点问题。如果你想稍后计算不同的文本，那就不走运了，因为c已经有前一个文本的计数了(所以你会得到两个文本的组合计数(。实际上，现在我看了它，你也把你的文件对象作为一个全局变量。

您可能应该在函数中创建c并将文件对象作为参数传递给函数，以便可以根据需要在不同的数据上重用它：

def n_grams(data, n): # pass the file (or some other iterable) as the data argument
    c = Counter() # create the Counter in the function
    for line in data:
        ngrams = zip(*[words[i:] for i in range(n)]) # compute the line's n-grams
        # or equivalently: [tuple(words[i:i+n]) for i in range(len(words)-n+1)]
        c.update(ngrams)   # add them to the count
    return c   # return the count at the end

相关内容

最新更新

热门标签：