用蟒蛇有效提取1-5克



我有一个3,000,000行的大文件,每行有20-40个单词。我必须从语料库中提取 1 到 5 个 ngram。我的输入文件是标记化的纯文本,例如:

This is a foo bar sentence .
There is a comma , in this sentence .
Such is an example text .

目前,我正在按如下方式进行操作,但这似乎不是提取 1-5 克的有效方法:

#!/usr/bin/env python -*- coding: utf-8 -*-
import io, os
from collections import Counter
import sys; reload(sys); sys.setdefaultencoding('utf-8')
with io.open('train-1.tok.en', 'r', encoding='utf8') as srcfin, 
io.open('train-1.tok.jp', 'r', encoding='utf8') as trgfin:
    # Extract words from file. 
    src_words = ['<s>'] + srcfin.read().replace('n', ' </s> <s> ').split()
    del src_words[-1] # Removes the final '<s>'
    trg_words = ['<s>'] + trgfin.read().replace('n', ' </s> <s> ').split()
    del trg_words[-1] # Removes the final '<s>'
    # Unigrams count.
    src_unigrams = Counter(src_words) 
    trg_unigrams = Counter(trg_words) 
    # Sum of unigram counts.
    src_sum_unigrams = sum(src_unigrams.values())
    trg_sum_unigrams = sum(trg_unigrams.values())
    # Bigrams count.
    src_bigrams = Counter(zip(src_words,src_words[1:]))
    trg_bigrams = Counter(zip(trg_words,trg_words[1:]))
    # Sum of bigram counts.
    src_sum_bigrams = sum(src_bigrams.values())
    trg_sum_bigrams = sum(trg_bigrams.values())
    # Trigrams count.
    src_trigrams = Counter(zip(src_words,src_words[1:], src_words[2:]))
    trg_trigrams = Counter(zip(trg_words,trg_words[1:], trg_words[2:]))
    # Sum of trigram counts.
    src_sum_trigrams = sum(src_bigrams.values())
    trg_sum_trigrams = sum(trg_bigrams.values())

还有其他方法可以更有效地做到这一点吗?

如何同时优化提取不同的N ngram?

从 python 中的快速/优化 N-gram 实现,本质上是这样的:

zip(*[words[i:] for i in range(n)])

当硬编码是双字母时,n=2

zip(src_words,src_words[1:])

这是三元组吗,n=3

zip(src_words,src_words[1:],src_words[2:])

如果您只对最常见(频繁(的 n -gram 感兴趣(我想这就是您的情况(,您可以重用 Apriori 算法的中心思想。给定s_min,一个最小支持,可以被认为是给定的 n -gram 包含的行数,它有效地搜索所有这样的 n -gram。

思路如下:编写一个查询函数,该函数采用 -gram n 并测试它在语料库中包含的次数。准备好这样的函数后(可能会在后面讨论中进行优化(,扫描整个语料库并获取所有1 -gram,即裸令牌,然后选择至少包含 s_min 次的标记。这为您提供了频繁1 -gram 的子集F1。然后通过组合 F1 中的所有 1 -gram 来测试所有可能的 2 -gram。同样,选择那些持有s_min标准的人,你会得到F2。通过组合F2中的所有2-grams并选择频繁的3-grams,您将获得F3。只要Fn为非空,就重复此操作。

可以在此处进行许多优化。当从Fn组合n -gram 时,你可以利用这样一个事实,即n -gram xy 只能组合成 (n+1) -gram iff x[1:] == y[:-1](如果使用适当的散列,可以在恒定时间内检查任何n(。此外,如果您有足够的 RAM(对于您的语料库,许多 GB(,您可以极大地加快查询功能。对于每个 1 -gram,存储包含给定1 -gram 的行索引的哈希集。将两个 n -gram 组合成一个 (n+1) -gram 时,使用两个相应集合的交集,获得可能包含 (n+1) -gram 的一组行。

时间复杂度随着s_min的降低而增加。美妙之处在于,不常见(因此无趣(的 n -gram 在算法运行时被完全过滤,从而仅为频繁的 -gram 节省了计算时间。

我给你一堆关于你试图解决的一般问题的指示。其中一个或多个应该对您有用,并帮助您解决这个问题。

对于您正在做的事情(我猜是某种机器翻译实验(,您实际上不需要同时将两个文件 srcfin 和 trgfin 加载到内存中(至少对于您提供的代码示例不是这样(。单独处理它们在给定时间需要保存在内存中的内容量会更便宜。

您正在将大量数据读入内存,对其进行处理(这需要更多内存(,然后将结果保存在一些内存数据结构中。与其这样做,不如努力变得更懒惰。了解 python 生成器并编写一个生成器,该生成器从给定文本中流式传输出所有 ngram,而无需在任何给定时间点将整个文本保存在内存中。itertools python包在编写本文时可能会派上用场。

超过一点,您将不再可行将所有这些数据保存在内存中。您应该考虑查看map-reduce来帮助您分解它。查看mrjob python包,它允许您用python编写mapreduce作业。在映射器步骤中,您将文本分解为其 ngram,在化简器阶段,您将计算看到每个 ngram 的次数以获得其总数。Mrjob's也可以在本地运行,这显然不会给你带来任何并行化的好处,但会很好,因为Mrjob仍然会为你做很多繁重的工作。

如果您被迫同时将所有计数保存在内存中(对于大量文本(,那么要么实施一些修剪策略来修剪掉非常罕见的 ngram,要么考虑使用一些基于文件的持久查找表,如 sqlite 来为您保存所有数据。

假设你不想计算行之间的 ngram,并假设朴素的标记化:

def ngrams(n, f):
    deque = collections.deque(maxlen=n)
    for line in f:
        deque.clear()
        words = ["<s>"] + line.split() + ["</s>"]
        deque.extend(words[:n-1]) # pre-seed so 5-gram counter doesn't count incomplete 5-grams
        for word in words[n-1:]:
            deque.append(word)
            yield tuple(str(w) for w in deque) # n-gram tokenization
counters = [collections.Counter(ngrams(i, open('somefile.txt'))) for i in range(5)]

编辑:添加了开始/结束行标记

生成的数据对象是我相信尽可能稀疏。 3m 行 40 个单词是 ~120m 个令牌。在英语中有~1m个单词(虽然不太常用(,你可能会得到一个相当长的尾巴。如果你可以想象你的数据是可交换的/iid,那么你可以在中间添加一些修剪:

def ngrams(n, f, prune_after=10000):
    counter = collections.Counter()
    deque = collections.deque(maxlen=n)
    for i, line in enumerate(f):
        deque.clear()
        words = ["<s>"] + line.split() + ["</s>"]
        deque.extend(words[:n-1])
        for word in words[n-1:]:
            deque.append(word)
            ngram = tuple(str(w) for w in deque)
            if i < prune_after or ngram in counter:
                counter[ngram] += 1
    return counter

放宽可交换性假设需要像Tregoreg对有效修剪的答案,但在大多数情况下,可交换性应该成立。

就原始速度而言,我认为zip(如原始代码(与deque是基本问题.zip 删除了最里面的循环,所以它可能已经非常快了。 Deque 需要最里面的循环,但也会迭代地消耗数据,因此它的工作内存占用量应该小得多。哪个更好可能取决于您的机器,但我想对于大型机器/小型数据,zip 会更快。然而,一旦你开始耗尽内存(特别是如果你开始谈论修剪(,deque 就会获得更多的优势。

最新更新