正则表达式在更多条件下匹配版权声明中的公司名称



一段时间以来,我一直试图找到一个强大的正则表达式来从版权声明中提取公司名称(并且对正则表达式知之甚少)。

在这个问题中: 正则表达式在几个条件下匹配版权声明中的公司名称

我得到了正则表达式:

(?i)(?:©(?:s*Copyright)?|Copyright(?:s*©)?)s*d+(?:s*-s*d+)?s*(.*?(?=W*Alls+rightss+reserved)|[^.]*(?=.)|.*)

但是当我尝试更多的例子时,我发现这还不够。我想更改它,使其也符合以下条件,同时仍然适用于所有以前的情况:

  1. 考虑到在"或"©版权"(以最后到者为准)之前可能会出现任何其他内容并忽略它。

例子:

602-226-2389 ©2019 Endurance International Group.
Copyright 1999 — 2019 © Iflexion. All rights reserved.
  1. 考虑到在"或"©版权"之后可能没有年份,但已经是公司名称。

例:

ISO 9001:2008, ISO/ IEC 27001:2005 © Mobikasa 2019
  1. 考虑到年份可能在"版权"或"©一词之前(我认为条件 1 也满足这一点)

例子:

© 2019 Copyright arcadia.io.
2018 © Power Tools LLC
  1. 如果有一个 | 匹配,直到那里,之后忽略其余的:

例:

Copyright 2019 ComputerEase Construction Software | 1-800-544-2530

您可以使用

(?i)(?:©(?:s*(?:d{4}(?:s*[-—–]s*d{4})?)?s*Copyright)?|Copyright(?:s*(?:d{4}(?:s*[-—–]s*d{4})?)?s*©)?)(?:s*d{4}(?:s*[-—–]s*d{4})?)?s*(.*?(?=s*[.|]|W*Alls+rightss+reserved)|.*b)

查看正则表达式演示

蟒蛇代码:

import re
s = "Copyright © 2019 Apple Inc. All rights reserved.rn© 2019 Quid, Inc. All Rights Reserved.rn© 2009 Database Designs rn© 2019 Rediker Software, All Rights Reservedrn©2019 EVOSUS, INC. ALL RIGHTS RESERVEDrn© 2019 Walmart. All Rights Reserved.rn© Copyright 2003-2019 Exxon Mobil Corporation. All Rights Reserved.rnCopyright © 1978-2019 Berkshire Hathaway Inc.rn© 2019 McKesson Corporationrn© 2019 UnitedHealth Group. All rights reserved.rn© Copyright 1999 - 2019 CVS HealthrnCopyright 2019 General Motors. All Rights Reserved.rn© 2019 Ford Motor Companyrn©2019 AT&T Intellectual Property. All rights reserved.rn© 2019 GENERAL ELECTRICrnCopyright ©2019 AmerisourceBergen Corporation. All Rights Reserved.rn© 2019 Verizonrn© 2019 Fannie MaernCopyright © 2018 Jonas Construction Software Inc. All rights reserved.rnAll Comments © Copyright 2017 Kroger | The Kroger Co. All Rights Reservedrn© 2019 Express Scripts Holding Company. All Rights Reserved. 1 Express Way, St. Louis, MO 63121rn© 2019 JPMorgan Chase & Co.rnCopyright © 1995 - 2018 Boeing. All Rights Reserved.rn© 2019 Bank of America Corporation. All rights reserved.rn© 1999 - 2019 Wells Fargo. All rights reserved. NMLSR ID 399801rn©2019 Cardinal Health. All rights reserved.rn© 2019 Quid, Inc All Rights Reserved.rn602-226-2389 ©2019 Endurance International Group.rnCopyright 1999 — 2019 © Iflexion. All rights reserved.rnISO 9001:2008, ISO/ IEC 27001:2005 © Mobikasa 2019rn© 2019 Copyright arcadia.io.rn2018 © Power Tools LLCrnCopyright 2019 ComputerEase Construction Software | 1-800-544-2530rn© 2019 3M. 3M Health Information Systems Privacy Policy"
rx = r'''(?xi)
(?:©                                        # Start of a group: © symbol
(?:s*                                      #  Start of optional group: 0+ whitespaces
(?:d{4}                                  #   Start of optional group: 4 digits
(?:s*[-—–]s*d{4})?                   #     0+ spaces, dashes, spaces, 4 digits
)?                                        #   End of group
s*Copyright                              #  Spaces and Copyright
)?                                          #  End of group 
|                                           #  OR 
Copyright                                   
(?:s*                                     #  Start of optional group: 0+ whitespaces
(?:d{4}                                 #   Start of optional group: 4 digits
(?:s*[-—–]s*d{4})?                  #     0+ spaces, dashes, spaces, 4 digits
)?s*©                                   #   End of group, 0+ spaces, ©
)?                                         #  End of group
)                                           # End of group
(?:s*d{4}(?:s*[-—–]s*d{4})?)?          # Optional group, 9999 optionally followed with dash enclosed with whitespaces and then 9999
s*                                         # 0+ whitespaces
(                                           # Start of a capturing group:
.*?                                      # any 0+ chars other than linebreak chars, as few as possible, up to...
(?=s*[.|]|                             # 0+ spaces and then | or ., or
W*Alls+rightss+reserved)         # All rights reserved with any 0+ non-word chars before it
|                                         # or
.*b                                     # any 0+ chars other than linebreak chars, as many as possible
)'''
for m in re.findall(rx, s):
print(m)

请参阅 Python 演示。输出:

Apple Inc
Quid, Inc
Database Designs
Rediker Software
EVOSUS, INC
Walmart
Exxon Mobil Corporation
Berkshire Hathaway Inc
McKesson Corporation
UnitedHealth Group
CVS Health
General Motors
Ford Motor Company
AT&T Intellectual Property
GENERAL ELECTRIC
AmerisourceBergen Corporation
Verizon
Fannie Mae
Jonas Construction Software Inc
Kroger
Express Scripts Holding Company
JPMorgan Chase & Co
Boeing
Bank of America Corporation
Wells Fargo
Cardinal Health
Quid, Inc
Endurance International Group
Iflexion
Mobikasa 2019
arcadia
Power Tools LLC
ComputerEase Construction Software
3M

我相信这个正则表达式可以满足您的需求。解释如下:

(?i)                                # make the regex case insensitive
(?:Copyrights*©?|©s*(Copyright)?) # Look for Copyright and/or © to get us started
([ds—-]+)?                        # There might be some digits, spaces, and dashes, but not necessarily
(©|Copyright)?s*                   # Copyright or © could be separated by dates, so look for them again
(.+?)                               # This is the sugar we're looking for
(?=All rights reserved|||$)        # If you find "All rights reserved" a | or end of string, stop capturing the text

我知道它的老问题,但想发布更好的解决方案。 我训练了空间模型,该模型在5k +版权文本样本上进行了训练。 这是模型和工作存储库链接

最新更新