使用正则表达式循环的更有效方法是什么？

我有一个名称列表，用来从目标字符串列表中提取。例如：

names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
output = ['Chris Smith', 'Kim', 'CHRIS']

到目前为止的规则是：

不区分大小写
无法匹配部分单词('ie Christmas/被劫持不应匹配Chris/Jack)
只要根据上述条件在字符串中找到名称，字符串中的其他单词就可以了

为了实现这一点，另一位SO用户在这个线程中建议使用以下代码：

[targ for targ in target_list if any(re.search(r'b{}b'.format(name), targ, re.I) for name in first_names)]

到目前为止，这项工作非常准确，但速度非常慢，因为名称列表的长度约为5000，目标列表的长度从20-100行到30个字符不等。

关于如何提高绩效，有什么建议吗？

解决方案：这两个基于正则表达式的解决方案都出现了OverflowErrors，所以很遗憾我无法测试它们。有效的解决方案(根据@mglison的回答)是：

new_names = set(name.lower() for name in names)
[ t for t in target if any(map(new_names.__contains__,t.lower().split())) ]

这使性能从15秒大幅提高到1秒以下。

似乎可以将它们组合成一个超级正则表达式：

import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
regex_string = '|'.join(r"(?:b"+re.escape(x)+r"b)" for x in names)
print regex_string
regex = re.compile(regex_string,re.I)
print [t for t in target if regex.search(t)]

一个非正则表达式解决方案，只有当名称是一个单词(没有空格)时才有效：

new_names = set(name.lower() for name in names)
[ t for t in target if any(map(new_names.__contains__,t.lower().split())) ]

CCD_ 1表达式也可以写成：

any(x in new_names for x in t.lower().split())

或

any(x.lower() in new_names for x in t.split())

或者，另一种依赖于set.intersection的变体(由下面的@DSM建议)：

[ t for t in target if new_names.intersection(t.lower().split()) ]

如果性能真的很关键，您可以评测哪一个性能最好，否则请选择您认为最容易阅读/理解的。

*如果你使用的是python2.x，如果你按照上面的路线让它进行懒惰评估，你可能会想使用itertools.imap而不是map——这也让我怀疑python是否提供了一个懒惰的str.split，它的性能与非懒惰版本相当。。。

这是我能想到的最简单的一个：

[item for item in target if re.search(r'b(%s)b' % '|'.join(names), item)]

全部：

import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
results = [item for item in target if re.search(r'b(%s)b' % '|'.join(names), item)]
print results
>>> 
['Chris Smith', 'Kim']

为了提高它的效率，您可以先编译regex。

regex = re.compile( r'b(%s)b' % '|'.join(names) )
[item for item in target if regex.search(item)]

编辑

在考虑了这个问题并查看了一些评论后，我将"解决方案"修改为：

import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
regex = re.compile( r'b((%s))b' % ')|('.join([re.escape(name) for name in names]), re.I )
results = [item for item in target if regex.search(item)]

结果：

>>> 
['Chris Smith', 'Kim', 'CHRIS']

您当前正在进行一个循环中的另一个循环，在两个列表上迭代。这总是会给你带来二次型的表现。

一个本地优化是编译每个名称regex(这将使应用每个regex更快)。然而，最大的胜利将是将所有正则表达式组合成一个正则表达式，并将其应用于输入中的每个项。请参阅@mgilson的答案了解如何做到这一点。之后，您的代码性能应该线性扩展为O(M+N)，而不是O(M*N)。

相关内容

最新更新

热门标签：