我有一组关键字,大约10个。我想在一个很长的文档中执行搜索,并检查我是否可以在那里找到关键字集,但不仅是它们在文本上的存在,还有它们的全部/一些或子集是否位于距离截止点,例如,3个句子,或30个单词或任何其他接近度量。一个人怎么能做到呢?我刚想到写一些python代码来查找其中一个关键字,然后检查是否有其他关键字在3行左右。但这将需要大量的计算能力,而且效率很低。
要确定一组关键字是否在较大文档中的给定距离内,可以使用一个长度等于给定距离的滑动窗口,并在文档中移动它。当你移动窗口的时候,记录每一个落进和落出窗口的单词。如果窗口包含所有关键字,则满足条件。该方法的时间复杂度为O(len(document))
,内存复杂度为O(len(window))
。
下面是上面描述的方法的Python示例实现:
from collections import defaultdict
from sets import Set
def isInProximityWindow(doc, keywords, windowLen):
words = doc.split()
wordsLen = len(words)
if (windowLen > wordsLen):
windowLen = wordsLen
keywordsLen = len(keywords)
allKeywordLocs = defaultdict(Set)
numKeywordsInWindow = 0
locKeyword = {}
for i in range(wordsLen):
windowContents = sorted([k for k in allKeywordLocs.keys() if allKeywordLocs[k]])
print "On beginning of iteration #%i, window contains '%s'" % (i, ','.join(windowContents))
oldKeyword = locKeyword.pop(i-windowLen, None)
if oldKeyword:
keywordLocs = allKeywordLocs[oldKeyword]
keywordLocs.remove(i-windowLen)
if not keywordLocs:
print "'%s' fell out of window" % oldKeyword
numKeywordsInWindow -= 1
word = words[i]
print "Next word is '%s'" % word
if word in keywords:
locKeyword[i] = word
keywordLocs = allKeywordLocs[word]
if not keywordLocs:
print "'%s' fell in window" % word
numKeywordsInWindow += 1
if numKeywordsInWindow == keywordsLen:
return True
keywordLocs.add(i)
return False
样本输出:>>> isInProximityWindow("the brown cow jumped over the moon and the red fox jumped over the black dog", Set(["fox", "over", "the"]), 4)
On beginning of iteration #0, window contains ''
Next word is 'the'
'the' fell in window
On beginning of iteration #1, window contains 'the'
Next word is 'brown'
On beginning of iteration #2, window contains 'the'
Next word is 'cow'
On beginning of iteration #3, window contains 'the'
Next word is 'jumped'
On beginning of iteration #4, window contains 'the'
'the' fell out of window
Next word is 'over'
'over' fell in window
On beginning of iteration #5, window contains 'over'
Next word is 'the'
'the' fell in window
On beginning of iteration #6, window contains 'over,the'
Next word is 'moon'
On beginning of iteration #7, window contains 'over,the'
Next word is 'and'
On beginning of iteration #8, window contains 'over,the'
'over' fell out of window
Next word is 'the'
On beginning of iteration #9, window contains 'the'
Next word is 'red'
On beginning of iteration #10, window contains 'the'
Next word is 'fox'
'fox' fell in window
On beginning of iteration #11, window contains 'fox,the'
Next word is 'jumped'
On beginning of iteration #12, window contains 'fox,the'
'the' fell out of window
Next word is 'over'
'over' fell in window
On beginning of iteration #13, window contains 'fox,over'
Next word is 'the'
'the' fell in window
True
解决这个问题的建议是创建一个(Hash)Map,输入每个单词作为键,并将单词的位置作为值添加到一个列表中,该列表是Map中的值。
对于文本,快速的棕色狐狸跳过了懒惰的狗,这将产生一个模型,如下所示(json格式)。
注释:这里所有的单词都被添加到索引中,就好像它们是小写的一样。
{
"document": [
{
"key": "the",
"value": [
{
"location": 1
},
{
"location": 7
}
]
},
{
"key": "quick",
"value": [
{
"location": 2
}
]
},
{
"key": "brown",
"value": [
{
"location": 3
}
]
},
{
"key": "fox",
"value": [
{
"location": 4
}
]
},
{
"key": "jumps",
"value": [
{
"location": 5
}
]
},
{
"key": "over",
"value": [
{
"location": 6
}
]
},
{
"key": "lazy",
"value": [
{
"location": 8
}
]
},
{
"key": "dog",
"value": [
{
"location": 9
}
]
}
]
}
一旦创建了索引,就很容易看到不同单词之间的距离。如这个单词所示,位于位置1和7。
另外,一个单词在文本中出现的次数,可以很容易地通过给出的单词的位置数量得到。
提示:添加额外的位置信息,如哪一章/节/页等
我在这些条件下运行了一些简单的基准测试:
- Python 3.4 in Windows
- 150个不同的随机单词,长度为5 - 16个字符
- 10个搜索词,必须全部找到
- 窗长75
- 迭代超过5000万字,总计约5.14亿字
词生成:
def generator(gen_salt):
words = [word(i) for i in range(n_distinct_words)]
np.random.seed(123)
for i in range(int(n_words)):
yield words[np.random.randint(0, n_distinct_words)]
搜索代码,words = generator, search_words = set, window_len = int
:
from collections import deque
from time import time
def deque_window(words, search_words, window_len):
start = time()
result = []
pos = 0
window = deque([], window_len)
for word in words:
window.append(word)
if word in search_words:
all_found = True
for search_word in search_words:
if search_word not in window:
all_found = False
break
if all_found:
result.append(pos)
pos += 1
return result, time() - start
实际上,即使计算字符总数也需要31秒,而在搜索窗口中找到包含所有单词的索引只需48秒。我不确定随机查找或列表查找是否真的那么慢。我需要一个更有效的生成器,也许我会将结果存储在磁盘上,并尝试从那里读取它(这将更接近您的场景)。
计算长度的和:
sum(len(w) for w in words)
您所需要的只是一个开源的 Apache Solr
软件。
Apache Solr是流行的、极快的、开源的企业搜索基于Apache Lucene™的平台
点击此链接获取更多信息。相信我,即使对于tb级的数据,这也能提供快速的结果。