Python Regex-在文本文件中的(多个)表达式之间提取文本



我是Python初学者,如果您能帮助我解决文本提取问题,我将不胜感激。

我想提取位于文本文件中两个表达式(字母的开头和结尾(之间的所有文本。对于字母的开头和结尾,都有多个可能的表达式(在列表"letter_begin"one_answers"letter_end"中定义,例如"Dear"、"to our"等(。如果"letter_end"没有匹配项,即没有找到letter_eend表达式,则输出应从letter_beginning开始,并在要分析的文本文件的最后结束。

编辑:"记录的文本"的末尾必须在"letter_end"匹配之后,且在包含20个字符或更多字符的第一行之前("此处也是随机文本"->len=24也是如此(。

"""Some random text here
 
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas
Random text here as well"""

到目前为止,这是我的代码,但它无法灵活地捕捉表达式之间的文本(在"letter_begin"之前和"letter_end"之后可以有任何内容(行、文本、数字、符号等((

import re
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")s+.*?" + r"(?:" + closings + r"),nS+"

with open(filename, 'r', encoding="utf-8") as infile:
text = infile.read()
text = str(text)
output = re.findall(regex, text, re.MULTILINE|re.DOTALL|re.IGNORECASE) # record all text between Regex (Beginning and End Expressions)
print (output)

我非常感谢大家的帮助!

您可以使用

regex = r"(?:{})[sS]*?(?:{}).*(?:n.*){{0,2}}".format(openings, closings)

此模式将导致类似的正则表达式

(?:dear|to our|estimated)[sS]*?(?:sincerely|yours|best regards).*(?:n.*){0,2}

请参阅regex演示。请注意,不应将re.DOTALL与此模式一起使用,并且re.MULTILINE选项也是多余的。

详细信息

  • (?:dear|to our|estimated)-三个值中的任意一个
  • [sS]*?-任何0+个字符,尽可能少
  • (?:sincerely|yours|best regards)-三个值中的任意一个
  • .*-换行符以外的任何0+个字符
  • (?:n.*){0,2}-换行符的零个、一个或两个重复,后跟换行符以外的任何0+个字符

Python演示代码:

import re
text="""Some random text here
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas
Random text here as well"""
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:{})[sS]*?(?:{}).*(?:n.*){{0,2}}".format(openings, closings)
print(regex)
print(re.findall(regex, text, re.IGNORECASE))

输出:

['Dear Shareholders Wenare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.nBest regards nDouglasn']

相关内容

  • 没有找到相关文章

最新更新