我正在寻找一种快速解决方案,其中代码应该在一个很长的句子列表上循环(每行(,并用另一个元组的匹配项替换一个元组(或列表(中的子字符串。(伪(代码应该是这样的:
# an example of one line sentence:
a = "I was thinking to begin this journey."
# tuples: targets and replacements
verbs = ("to begin", "I begin", "you begin", "we begin")
verbs_fixed = ("toXXbegin", "IXXbegin", "youXXbegin", "weXXbegin")
with open(<INPUT FILE NAME>) as inf:
for line in inf:
line = ????
考虑到句子列表很长,我希望能找到最快的解决方案。
我在想re.compile
,然后是一些列表理解。有更好的方法吗?
如果压缩两个列表,则只有简单的替换:
for original_value, target_value in zip(verbs, verbs_fixed):
line = line.replace(original_value, target_value)
使用正则表达式
def regex_mapping(sentence):
" Function to do the replacements based upon mapping of verbs to verbs fixed"
return regex_pattern.sub(lambda m: mapping[m.group(0)], sentence)
# Setup code
verbs = ("to begin", "I begin", "you begin", "we begin")
verbs_fixed = ("toXXbegin", "IXXbegin", "youXXbegin", "weXXbegin")
# Dictionary mapping
mapping = {x:y for x, y in zip(verbs, verbs_fixed)}
# Regex pattern (pre-compile for speed)
regex_pattern = re.compile('|'.join(verbs))
用法
a = "I was thinking to begin this journey."
print(regex_mapping(a))
附录
如果你的关键词列表有数百个,你应该研究这个基于构建Trie词典的解决方案。