我有一个字典,它的键:值对对应于复合单词和我想在文本中替换它们的表达式。例如:
terms_dict = {'digi conso': 'digi conso', 'digi': 'digi conso', 'digiconso': 'digi conso', '3xcb': '3xcb', '3x cb': '3xcb', 'legal entity identifier': 'legal entity identifier'}
我的目标是创建一个函数replace_terms(text, dict),它以文本和字典作为参数,并在替换复合词后返回文本。
例如:
test_text = "i want a digi conso loan for digiconso"
print(replace_terms(test_text, terms_dict))
应该返回:
"i want a digi conso loan for digi conso"
我尝试过使用。replace(),但由于某些原因它不能正常工作,可能是因为要替换的术语由多个单词组成。
我也试过这个:
def replace_terms(text, terms_dict):
if len(terms_dict) > 0:
words_in = [k for k in terms_dict.keys() if k in text] # ex: words_in = [digi conso, digi, digiconso]
if len(words_in) > 0:
for w in words_in:
pattern = r"b" + w + r"b"
text = re.sub(pattern, terms_dict[w], text)
return text
但是当应用到我的文本时,这个函数返回:"我想要一个digi conso">,单词conso得到加倍,我可以看到为什么(因为words_in列表是通过遍历字典键创建的,当一个键追加到列表时,文本不会改变)。
是否有有效的方法来做到这一点?
非常感谢!
一种快速而不可靠的方法:
from typing import Dict, List, Tuple
def replace_terms(text: str, terms: Dict[str, str]) -> str:
replacement_list: List[Tuple[int, str]] = []
check = True
for term in terms:
if term in text:
for replacement in replacement_list:
if replacement[0] == text.index(term):
if len(term) > len(replacement[1]):
replacement_list.remove(replacement)
else:
check = False
if check:
replacement_list.append((text.index(term), term))
else:
check = True
for replacement in replacement_list:
text = text.replace(replacement[1], terms[replacement[1]], 1)
return text
用法:
terms_dict = {
"digi conso": "digi conso",
"digi": "digi conso",
"digiconso": "digi conso",
"3xcb": "3xcb",
"3x cb": "3xcb",
"legal entity identifier": "legal entity identifier"
}
test_text = "i want a digi conso loan for digiconso"
print(replace_terms(test_text, terms_dict))
结果:i want a digi conso loan for digi conso
应该可以了。
terms_dict = { 'digiconso': 'digi conso', '3xcb': '3xcb', '3x cb': '3xcb', 'legal entity identifier': 'legal entity identifier'}
test_text = "i want a digi conso loan for digiconso"
def replace_terms(txt, dct):
dct = tuple(dct.items())
for x, y in dct:
txt = txt.replace(x, y, 1)
return txt
print(replace_terms(test_text, terms_dict))
首先获得字典对,并以更简单的形式(元组)获得它们。然后我再替换!
输出:
i want a digi conso loan for digi conso
你有很多你不需要的额外的替换标识符。我也让它只替换1,但你可以改变它。