如何在文档中找到一组关键字,所有/某些关键字都在一定的距离截断处



我有一组关键字,大约10个。我想在一个很长的文档中执行搜索,并检查我是否可以在那里找到关键字集,但不仅是它们在文本上的存在,还有它们的全部/一些或子集是否位于距离截止点,例如,3个句子,或30个单词或任何其他接近度量。一个人怎么能做到呢?我刚想到写一些python代码来查找其中一个关键字,然后检查是否有其他关键字在3行左右。但这将需要大量的计算能力,而且效率很低。

要确定一组关键字是否在较大文档中的给定距离内,可以使用一个长度等于给定距离的滑动窗口,并在文档中移动它。当你移动窗口的时候,记录每一个落进和落出窗口的单词。如果窗口包含所有关键字,则满足条件。该方法的时间复杂度为O(len(document)),内存复杂度为O(len(window))

下面是上面描述的方法的Python示例实现:

from collections import defaultdict
from sets import Set
def isInProximityWindow(doc, keywords, windowLen):
    words = doc.split()
    wordsLen = len(words)
    if (windowLen > wordsLen):
        windowLen = wordsLen
    keywordsLen = len(keywords)
    allKeywordLocs = defaultdict(Set)
    numKeywordsInWindow = 0
    locKeyword = {}
    for i in range(wordsLen):
        windowContents = sorted([k for k in allKeywordLocs.keys() if allKeywordLocs[k]])
        print "On beginning of iteration #%i, window contains '%s'" % (i, ','.join(windowContents))
        oldKeyword = locKeyword.pop(i-windowLen, None)
        if oldKeyword:
            keywordLocs = allKeywordLocs[oldKeyword]
            keywordLocs.remove(i-windowLen)
            if not keywordLocs:
                print "'%s' fell out of window" % oldKeyword
                numKeywordsInWindow -= 1
        word = words[i]
        print "Next word is '%s'" % word
        if word in keywords:
            locKeyword[i] = word
            keywordLocs = allKeywordLocs[word]
            if not keywordLocs:
                print "'%s' fell in window" % word
                numKeywordsInWindow += 1
                if numKeywordsInWindow == keywordsLen:
                    return True
            keywordLocs.add(i)
    return False
样本输出:

>>> isInProximityWindow("the brown cow jumped over the moon and the red fox jumped over the black dog", Set(["fox", "over", "the"]), 4)
On beginning of iteration #0, window contains ''
Next word is 'the'
'the' fell in window
On beginning of iteration #1, window contains 'the'
Next word is 'brown'
On beginning of iteration #2, window contains 'the'
Next word is 'cow'
On beginning of iteration #3, window contains 'the'
Next word is 'jumped'
On beginning of iteration #4, window contains 'the'
'the' fell out of window
Next word is 'over'
'over' fell in window
On beginning of iteration #5, window contains 'over'
Next word is 'the'
'the' fell in window
On beginning of iteration #6, window contains 'over,the'
Next word is 'moon'
On beginning of iteration #7, window contains 'over,the'
Next word is 'and'
On beginning of iteration #8, window contains 'over,the'
'over' fell out of window
Next word is 'the'
On beginning of iteration #9, window contains 'the'
Next word is 'red'
On beginning of iteration #10, window contains 'the'
Next word is 'fox'
'fox' fell in window
On beginning of iteration #11, window contains 'fox,the'
Next word is 'jumped'
On beginning of iteration #12, window contains 'fox,the'
'the' fell out of window
Next word is 'over'
'over' fell in window
On beginning of iteration #13, window contains 'fox,over'
Next word is 'the'
'the' fell in window
True

解决这个问题的建议是创建一个(Hash)Map,输入每个单词作为键,并将单词的位置作为值添加到一个列表中,该列表是Map中的值。

对于文本,快速的棕色狐狸跳过了懒惰的狗,这将产生一个模型,如下所示(json格式)。

注释:这里所有的单词都被添加到索引中,就好像它们是小写的一样。

{
    "document": [
        {
            "key": "the",
            "value": [
                {
                    "location": 1
                },
                {
                    "location": 7
                }
            ]
        },
        {
            "key": "quick",
            "value": [
                {
                    "location": 2
                }
            ]
        },
        {
            "key": "brown",
            "value": [
                {
                    "location": 3
                }
            ]
        },
        {
            "key": "fox",
            "value": [
                {
                    "location": 4
                }
            ]
        },
        {
            "key": "jumps",
            "value": [
                {
                    "location": 5
                }
            ]
        },
        {
            "key": "over",
            "value": [
                {
                    "location": 6
                }
            ]
        },
        {
            "key": "lazy",
            "value": [
                {
                    "location": 8
                }
            ]
        },
        {
            "key": "dog",
            "value": [
                {
                    "location": 9
                }
            ]
        }
    ] 
}
一旦创建了索引,就很容易看到不同单词之间的距离。如这个单词所示,位于位置1和7。

另外,一个单词在文本中出现的次数,可以很容易地通过给出的单词的位置数量得到。

提示:添加额外的位置信息,如哪一章/节/页等

我在这些条件下运行了一些简单的基准测试:

  • Python 3.4 in Windows
  • 150个不同的随机单词,长度为5 - 16个字符
  • 10个搜索词,必须全部找到
  • 窗长75
  • 迭代超过5000万字,总计约5.14亿字

词生成:

def generator(gen_salt):
    words = [word(i) for i in range(n_distinct_words)]
    np.random.seed(123)
    for i in range(int(n_words)):
        yield words[np.random.randint(0, n_distinct_words)]

搜索代码,words = generator, search_words = set, window_len = int:

from collections import deque
from time import time
def deque_window(words, search_words, window_len):
    start = time()
    result = []
    pos = 0
    window = deque([], window_len)
    for word in words:
        window.append(word)
        if word in search_words:
            all_found = True
            for search_word in search_words:
                if search_word not in window:
                    all_found = False
                    break
            if all_found:
                result.append(pos)
        pos += 1
    return result, time() - start

实际上,即使计算字符总数也需要31秒,而在搜索窗口中找到包含所有单词的索引只需48秒。我不确定随机查找或列表查找是否真的那么慢。我需要一个更有效的生成器,也许我会将结果存储在磁盘上,并尝试从那里读取它(这将更接近您的场景)。

计算长度的和:

sum(len(w) for w in words)

您所需要的只是一个开源的 Apache Solr 软件。

Apache Solr是流行的、极快的、开源的企业搜索基于Apache Lucene™的平台

点击此链接获取更多信息。相信我,即使对于tb级的数据,这也能提供快速的结果。