如何忽略正则表达式中不需要的模式

我有以下python代码

from io import BytesIO
import pdfplumber, requests
test_case = {
'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0514/2020051400555.pdf': 59,
'https://www1.hkexnews.hk/listedco/listconews/gem/2020/0529/2020052902118.pdf': 55,
'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0618/2020061800366.pdf': 47,
'https://www1.hkexnews.hk/listedco/listconews/gem/2020/0630/2020063002674.pdf': 30,
}
for url, page in test_case.items():
rq = requests.get(url)
pdf = pdfplumber.load(BytesIO(rq.content))
txt = pdf.pages[page].extract_text()
txt = re.sub("([^x00-x7F])+", "", txt)  # no chinese
pattern = r'.*n.*?(?P<auditor>[A-Z].+?n?)(?:LLPs*)?s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'
try:
auditor = re.search(pattern, txt, flags=re.MULTILINE).group('auditor').strip()
print(repr(auditor))
except AttributeError:
print(txt)
print('============')
print(url)

它产生以下结果

'ShineWing'
'ShineWing'
'Hong Kong Standards on Auditing (HKSAs) issued by the Hong Kong Institute of'
'Hong Kong Financial Reporting Standards issued by the Hong Kong Institute of'

期望的结果是：

'ShineWing'
'ShineWing'
'Ernst & Young'
'Elite Partners CPA Limited'

我试过了：

pattern = r'.*n.*?(?P<auditor>[A-Z].+?n?)$(?!Institute)(?:LLPs*)?s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'此模式捕获最后两种情况，但不捕获前两种情况。

pattern = r'.*n.*?(?P<auditor>^(?!Hong|Kong)[A-Z].+?n?)(?:LLPs*)?s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'这产生了所需的结果，但^(?!Hong|Kong)具有潜在的风险，因为它可能在未来忽略其他所需结果，因此它不是一个好的候选者。

相反，$(?!Institute)更通用、更合适，但我不知道为什么在前两种情况下不能匹配。如果有一种方法可以忽略包含issued by the Hong Kong Institute of的匹配，那就太好了

任何建议都将不胜感激。非常感谢。

pattern = r'n.*?(?P<auditor>(?!.*Institute)[A-Z].+?)(?:LLPs*)?s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'

这是有效的。

相关内容

最新更新

热门标签：