使用像leveinstein(leveinstein或difflib)这样的算法,很容易找到近似匹配。例如。
>>> import difflib
>>> difflib.SequenceMatcher(None,"amazing","amaging").ratio()
0.8571428571428571
可以通过根据需要确定阈值来检测模糊匹配。
当前要求:在较大的字符串中根据阈值找到模糊子字符串。
例如。
large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
#result = "manhatan","manhattin" and their indexes in large_string
一种暴力解决方案是生成长度为N-1到N+1(或其他匹配长度)的所有子串,其中N是query_string的长度,并逐一对其使用levenstein并查看阈值。
python中是否有更好的解决方案,最好是python 2.7中包含的模块,或者外部可用的模块。
--------------------更新和解决方案------------------
Python正则表达式模块工作得很好,尽管它比内置的re
模块在模糊子字符串情况下慢一点,这是由于额外的操作而产生的明显结果。期望的输出是好的,并且可以容易地定义对模糊性大小的控制。
>>> import regex
>>> input = "Monalisa was painted by Leonrdo da Vinchi"
>>> regex.search(r'b(leonardo){e<3}s+(da)s+(vinci){e<2}b',input,flags=regex.IGNORECASE)
<regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>
即将取代的新regex库重新包含模糊匹配。
https://pypi.python.org/pypi/regex/
模糊匹配语法看起来相当有表现力,但这会让您通过一个或更少的插入/添加/删除进行匹配。
import regex
regex.match('(amazing){e<=1}', 'amaging')
我使用fuzzywuzzy基于阈值进行模糊匹配,并使用fuzzysearch从匹配中模糊提取单词。
process.extractBests
获取查询、单词列表和截止分数,并返回匹配的元组列表和高于截止分数的分数。
CCD_ 3取CCD_ 4的结果并返回单词的起始索引和结束索引。我使用索引来构建单词,并使用构建的单词在大字符串中查找索引。find_near_matches
的max_l_dist
是"Levenstein距离",必须根据需要进行调整。
from fuzzysearch import find_near_matches
from fuzzywuzzy import process
large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
def fuzzy_extract(qs, ls, threshold):
'''fuzzy matches 'qs' in 'ls' and returns list of
tuples of (word,index)
'''
for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold):
print('word {}'.format(word))
for match in find_near_matches(qs, word, max_l_dist=1):
match = word[match.start:match.end]
print('match {}'.format(match))
index = ls.find(match)
yield (match, index)
测试:
query_string = "manhattan"
print('query: {}nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 70):
print('match: {}nindex: {}'.format(match, index))
query_string = "citi"
print('query: {}nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
print('match: {}nindex: {}'.format(match, index))
query_string = "greet"
print('query: {}nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
print('match: {}nindex: {}'.format(match, index))
输出:
query: manhattan
string: thelargemanhatanproject is a great project in themanhattincity
match: manhatan
index: 8
match: manhattin
index: 49
query: citi
string: thelargemanhatanproject is a great project in themanhattincity
match: city
index: 58
query: greet
string: thelargemanhatanproject is a great project in themanhattincity
match: great
index: 29
使用difflib.SequenceMatcher.get_matching_blocks
怎么样?
>>> import difflib
>>> large_string = "thelargemanhatanproject"
>>> query_string = "manhattan"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.8888888888888888
>>> query_string = "banana"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.6666666666666666
更新
import difflib
def matches(large_string, query_string, threshold):
words = large_string.split()
for word in words:
s = difflib.SequenceMatcher(None, word, query_string)
match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
if len(match) / float(len(query_string)) >= threshold:
yield match
large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
print list(matches(large_string, query_string, 0.8))
以上代码打印:['manhatan', 'manhattn']
上面的方法很好,但我需要在大量的干草中找到一根小针,结果像这样接近它:
from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs
needle = "this is the string we want to find"
hay = "text text lots of text and more and more this string is the one we wanted to find and here is some more and even more still"
needle_length = len(needle.split())
max_sim_val = 0
max_sim_string = u""
for ngram in ngrams(hay.split(), needle_length + int(.2*needle_length)):
hay_ngram = u" ".join(ngram)
similarity = SM(None, hay_ngram, needle).ratio()
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
print max_sim_val, max_sim_string
收益率:
0.72972972973 this string is the one we wanted to find
最近我为Python编写了一个对齐库:https://github.com/eseraygun/python-alignment
使用它,您可以在任意一对序列上使用任意评分策略执行全局和局部比对。实际上,在您的情况下,您需要半局部对齐,因为您不关心query_string
的子字符串。在下面的代码中,我已经使用局部对齐和一些启发式方法模拟了半局部算法,但为了正确的实现,扩展库很容易。
以下是针对您的案例修改的README文件中的示例代码。
from alignment.sequence import Sequence, GAP_ELEMENT
from alignment.vocabulary import Vocabulary
from alignment.sequencealigner import SimpleScoring, LocalSequenceAligner
large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
# Create sequences to be aligned.
a = Sequence(large_string)
b = Sequence(query_string)
# Create a vocabulary and encode the sequences.
v = Vocabulary()
aEncoded = v.encodeSequence(a)
bEncoded = v.encodeSequence(b)
# Create a scoring and align the sequences using local aligner.
scoring = SimpleScoring(1, -1)
aligner = LocalSequenceAligner(scoring, -1, minScore=5)
score, encodeds = aligner.align(aEncoded, bEncoded, backtrace=True)
# Iterate over optimal alignments and print them.
for encoded in encodeds:
alignment = v.decodeSequenceAlignment(encoded)
# Simulate a semi-local alignment.
if len(filter(lambda e: e != GAP_ELEMENT, alignment.second)) != len(b):
continue
if alignment.first[0] == GAP_ELEMENT or alignment.first[-1] == GAP_ELEMENT:
continue
if alignment.second[0] == GAP_ELEMENT or alignment.second[-1] == GAP_ELEMENT:
continue
print alignment
print 'Alignment score:', alignment.score
print 'Percent identity:', alignment.percentIdentity()
print
minScore=5
的输出如下。
m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889
m a n h a t t - i
m a n h a t t a n
Alignment score: 5
Percent identity: 77.7777777778
m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889
如果去掉minScore
参数,则只能得到得分最好的匹配项。
m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889
m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889
注意,库中的所有算法都具有O(n * m)
时间复杂度,n
和m
是序列的长度。
我遇到了这个问题,我发现前两个答案都不起作用。相反,我使用以下算法来检测错误最小的模糊匹配:
def fuzzy_substring_search(cls, major: str, minor: str, errs: int = 4) -> Optional[regex.Match]:
"""Find the closest matching fuzzy substring.
Args:
major: the string to search in
minor: the string to search with
errs: the total number of errors
Returns:
Optional[regex.Match] object
"""
errs_ = 0
s = regex.search(f"({minor}){{e<={errs_}}}", major)
while s is None and errs_ <= errs:
errs_ += 1
s = regex.search(f"({minor}){{e<={errs_}}}", major)
return s
这样做的好处是,如果存在,可以返回精确的匹配,并根据需要升级模糊性。