我已经解析了一个URL,以使用美丽的汤包获取其文本。我想删除条款和条件部分中发现的所有文本,即"关键术语:.......... t& cs应用"中的所有单词。
以下是我尝试的:
import re
#"text" is part of the text contained in the url
text="Welcome to Company Key.
Key Terms; Single bets only. Any returns from the free bet will be paid
back into your account minus the free bet stake. Free bets can only be
placed at maximum odds of 5.00 (4/1). Bonus will expire midnight, Tuesday
26th February 2019. Bonus T&Cs and General T&Cs apply.
"
rex=re.compile('Key (.*?)T&Cs.')"""to remove words between "Key" and
"T&Cs" """
terms_and_cons=rex.findall(text)
text=re.sub("|".join(terms_and_cons)," ",text)
#I also tried: text=re.sub(terms_and_cons[0]," ",text)
print(text)
上面的字符串"文本"不变,即使列表" enter_and_cons"是非空的。如何成功删除"键"one_answers" t& cs"之间的单词?请帮我。我已经陷入了相当长的一段时间,这真是令人沮丧。谢谢。
您缺少正则 re.DOTALL
标志,以将newline字符与点匹配。
方法1:使用re.sub
import re
text="""Welcome to Company Key.
Key Terms; Single bets only. Any returns from the free bet will be paid
back into your account minus the free bet stake. Free bets can only be
placed at maximum odds of 5.00 (4/1). Bonus will expire midnight, Tuesday
26th February 2019. Bonus T&Cs and General T&Cs apply.
"""
rex = re.compile("Keys(.*)T&Cs", re.DOTALL)
text = rex.sub("Key T&Cs", text)
print(text)
方法2:使用组
将文本与一个组匹配,然后从原始文本中删除该组的文本。
import re
text="""Welcome to Company Key.
Key Terms; Single bets only. Any returns from the free bet will be paid
back into your account minus the free bet stake. Free bets can only be
placed at maximum odds of 5.00 (4/1). Bonus will expire midnight, Tuesday
26th February 2019. Bonus T&Cs and General T&Cs apply.
"""
rex = re.compile("Keys(.*)T&Cs", re.DOTALL)
matches = re.search(rex, text)
text = text.replace(matches.group(1), "")
print(text)