Python:使用正则表达式仅在字符串中的特定单词之后查找完整的文本



有如下文本:

text = list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment 
slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything 
written manually on the checklist will not be considered invoice parth enterprise â invoice no dated 
kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka 
mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order 
no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no 
vill bitta ta naliya abadasa despatched through destination march 18 terms of

目的:我想提取"发票"一词后面的文本,特别是"发票"的第二次出现

我的方法:

txt = re.findall('invoice (.*)',text)

在上面的方法中,我期待如下字符串列表:

txt = ['in favour of company z 02 cjpc abstract sheet weighment 
slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything 
written manually on the checklist will not be considered','parth enterprise â invoice no dated 
kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment 
taluka ..... #rest of the string]

但是我得到了text中给出的整个字符串,即原始字符串。 如果我使用text.partition('invoice')则无法获得txt中所述的正确字符串。

任何帮助将不胜感激。

如果您想像问题中一样获得 2 场比赛,您可以使用 2 个捕获组。

第一次匹配,直到第一次出现发票。然后在第二次出现发票之前在组 1 中捕获。

然后再次匹配发票,并捕获组 2 中字符串的其余部分。

^.*? invoice (.*?) invoice (.*)

正则表达式演示 |蟒蛇演示

例如

import re
text = "list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 xyz@abc.comchecklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything written manually on the checklist will not be considered invoice parth enterprise â invoice no dated kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no vill bitta ta naliya abadasa despatched through destination march 18 terms of"
regex = r"^.*? invoice (.*?) invoice (.*)"
matches = re.search(regex, text)
if matches:
print(matches.group(1))
print('n')
print(matches.group(2))

输出

in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 xyz@abc.comchecklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything written manually on the checklist will not be considered

parth enterprise â invoice no dated kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no vill bitta ta naliya abadasa despatched through destination march 18 terms of

这可以通过split((方法轻松完成 例如:

myText="jhon is going abroad jhon is thinking about future jhon is absent"
1)  print(myText.split('jhon',1)[1])
output -> is going abroad jhon is thinking about future jhon is absent
2)  print(myText.split('jhon',2)[2])
output -> is thinking about future jhon is absent
3)  print(myText.split('jhon',3)[3])
output -> is absent
1 -> it will print text after first occurrence of jhon
2 -> it will print text after second occurrence of jhon
3 -> it will print text after third occurrence of jhon

您的正则表达式invoice (.*)将匹配第一个文字invoice后跟空格,然后(.*)贪婪地捕获组 1 中的其余文本,这是正在发生的事情,并且是预期的正确行为。

但是,如果您想获得您提到的输出,则必须相应地编写正则表达式。您可以使用以下正则表达式来实现所需的结果,

invoice (.*?)(?=(?:(?:invoice.*){2,}|$))

正则表达式解释:

  • invoice- 匹配文字发票和空格
  • (.*?)- 以懒惰的方式匹配文本
  • (?=(?:(?:invoice.*){2,}|$))- 在看到 2 个文字invoice文本时立即停止匹配或在整个输入结束时停止匹配

演示

蟒蛇演示,

import re
s = '''list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 xyz@abc.comchecklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything written manually on the checklist will not be considered invoice parth enterprise â invoice no dated kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no vill bitta ta naliya abadasa despatched through destination march 18 terms of'''
print(re.findall(r'invoice (.*?)(?=(?:(?:invoice.*){2,}|$))', s))

输出如你所愿,

['in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 xyz@abc.comchecklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything written manually on the checklist will not be considered ', 'parth enterprise â invoice no dated kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no vill bitta ta naliya abadasa despatched through destination march 18 terms of']

更新

我使用的正则表达式依赖于积极的回顾积极的展望

(?<=binvoice )(?:.*?)(?= invoiceb)
  1. (?<=binvoice )仅当前面有以单词边界开头的invoice时,才与以下子表达式匹配。
  2. (?:.*?)(?= invoiceb)零次或多次(非贪婪地(匹配任何字符,直到下一个字符invoice以单词边界结尾。

由于我复制了输入并包含原始输入中没有的换行符,因此我必须使用标志re.DOTALL以便.可以匹配换行符。但是,如果输入没有换行符(但不会造成伤害(,则不需要这样做。

查看正则表达式演示

代码:

import re
text= r"""list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment
slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything
written manually on the checklist will not be considered invoice parth enterprise â invoice no dated
kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka
mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order
no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no
vill bitta ta naliya abadasa despatched through destination march 18 terms of"""
matches = re.findall(r'(?<=binvoice )(?:.*?)(?= invoiceb)', text, flags=re.DOTALL)
for i, match in enumerate(matches):
print(f'nMatch {i + 1}:n', match, sep='')

指纹:

Match 1:
in favour of company z 02 cjpc abstract sheet weighment
slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything
written manually on the checklist will not be considered
Match 2:
parth enterprise â

使用用于拆分输入的更简单的正则表达式可以更有效地完成此问题:

import re
text= r"""list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment
slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything
written manually on the checklist will not be considered invoice parth enterprise â invoice no dated
kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka
mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order
no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no
vill bitta ta naliya abadasa despatched through destination march 18 terms of"""
#matches = re.split(r'bs*invoices*b', text)[1:-1] # if arbitrary white space can come before and after "invoice"
matches = re.split(r'b ?invoice ?b', text)[1:-1]
for i, match in enumerate(matches):
print(f'nMatch {i + 1}:n', match, sep='')

指纹:

Match 1:
in favour of company z 02 cjpc abstract sheet weighment
slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything
written manually on the checklist will not be considered
Match 2:
parth enterprise â

最新更新