我有一封信,需要提取其中的某个部分。开头和结尾由清晰的开头/结尾表达式(letter_beg
/letter_end
(标记。我的问题是,文本的"录制"需要在letter_end
的"匹配"之后的20多个字符的第一行之前结束。在我的代码中,它是在两行新行之后执行的。这是我到目前为止的示例文本和代码:
sample_text = """Some random text right here
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line """
letter_begin = ["dear", "to our", "fellow investors"] # All expressions for "beginning" of Letter to the Shareholders (LttS)
openings = "|".join(letter_begin)
letter_end = ["sincerely", "best regards", "cordially,"] # All expressions for "ending" of Letter to the Shareholders (LttS)
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")[sS]*?" + r"(?:" + closings + r").*(?:n.*){0,2}"
output = re.findall(regex, text, re.IGNORECASE) # record all text between Regex (beginning and end expressions)
print(output)
我不完全确定您的预期输出是什么,但在没有正则表达式的情况下实现这一点非常简单(从而消除了一个问题(。
下面的解决方案假设sample_text
包含n
(断线(,并且如果sample_text
是一条一长的线(即没有任何n
(,则该解决方案将不起作用。
sample_text = """Some random text right here
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
"""
letter_begin = ["dear", "to our", "fellow investors"]
letter_end = ["sincerely", "best regards", "cordially,"]
lines = sample_text.strip().split("n")
target_start_idx = None
target_end_idx = None
for index, line in enumerate(lines):
line = line.lower()
if any(line.startswith(beg) for beg in letter_begin):
target_start_idx = index
continue
if any(line.startswith(end) for end in letter_end):
target_end_idx = index
break
if target_end_idx is not None:
for index, line in enumerate(lines[target_end_idx + 1 :]):
if len(line) >= 20:
target_end_idx += index
break
if target_start_idx is not None and target_end_idx is not None:
target = "n".join(lines[target_start_idx : target_end_idx + 1])
print(target)
输出为
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
编辑
根据你上次的评论,我可以想出两种方法。希望其中一个能解决你的问题。
选项1
sample_text = """Some random text right here
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
"""
letter_begin = ["dear", "to our", "fellow investors"]
letter_end = ["sincerely", "best regards", "cordially,"]
lines = sample_text.strip().split("n")
target_start_indexes = []
target_end_indexes = []
for index, line in enumerate(lines):
line = line.lower()
if any(beg in line for beg in letter_begin):
target_start_indexes.append(index)
continue
if any(end in line for end in letter_end):
target_end_indexes.append(index)
continue
for target_index, target_end_idx in enumerate(target_end_indexes):
for line_index, line in enumerate(lines[target_end_idx + 1 :]):
if len(line) >= 20:
target_end_idx += line_index
target_end_indexes[target_index] = target_end_idx
break
target = []
if target_start_indexes and target_end_indexes:
for target_start_idx, target_end_idx in zip(
target_start_indexes, target_end_indexes
):
target.append("n".join(lines[target_start_idx : target_end_idx + 1]))
print("n".join(target))
输出
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
选项2
sample_text = """Some random text right here
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
"""
letter_begin = ["dear", "to our", "fellow investors"]
letter_end = ["sincerely", "best regards", "cordially,"]
lines = sample_text.strip().split("n")
target_start_idx = None
target_end_idx = None
for index, line in enumerate(lines):
line = line.lower()
if any(beg in line for beg in letter_begin):
if target_start_idx is None:
target_start_idx = index
continue
if any(end in line for end in letter_end):
target_end_idx = index
if target_end_idx is not None:
for index, line in enumerate(lines[target_end_idx + 1 :]):
if len(line) >= 20:
target_end_idx += index
break
if target_start_idx is not None and target_end_idx is not None:
target = "n".join(lines[target_start_idx : target_end_idx + 1])
print(target)
输出
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
如果您坚持使用单片regex,请为末尾包含20个以上字符的行添加一个正向前瞻:
(?=[^n]{21,})
您可能还需要添加re.DOTALL
标志:
re.IGNORECASE | re.DOTALL