Regex用于提取日期中的月份和年份组合



我使用正则表达式提取文本中日期对的月份和年份:

regex = (
r"((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?(t)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)"
r"s?[.s’’,/',‘-–—]?s?(d{4}|d{2})?s?s?((to)|[|-–—])s?s?"
r"((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?(t)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)"
r"s?[.s’’,/',‘-–—]?s?(d{4}|d{2})|(Present|Now|tills?(now|date|today)?|current)))"
)

当我用一些输入测试regex时,其中一些输入包含月日,而另一些输入则不包含:

lst = [
'July 2014 - 28th August 2014',
'Jan 2012 - 3rd sep 2014',
'Jan 2008 - May 2012',
'Jan 2008 and May 2012'
]
for i in lst:
word = re.finditer(regex,i,re.IGNORECASE)
for match in word:
print(match.group())

我得到以下输出:

Jan 2008 - May 2012

但我的预期输出是:

July 2014 - August 2014
Jan 2012 - sep 2014
Jan 2008 - May 2012

我需要更改什么才能使regex将文本与日期中的可选日期相匹配?当日期字符串包含日期时,它总是一个后缀为stndrdth的序数。

您不能"跳过";在单个匹配操作中是字符串的一部分,因此如果您有26th August,则不能仅匹配或捕获26 August。在这些情况下,您要么需要捕获匹配的部分,然后将它们连接起来,要么将不需要的部分替换为后处理步骤。

因此,在这里,我将使用的后处理替换方法

import re

day = r'(?:((?:0?[1-9]|[12]d|3[01])(?:s*(?:st|[rn]d|th))?)s*)?'
month = r'(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)' 
year = r'(d{2}(?:d{2})?)'
rx_valid = re.compile( fr'b{day}{month}s*{year}s*[-—–]s*{day}{month}s*{year}(?!d)', re.IGNORECASE )
rx_ordinal = re.compile( r's*d+s*(?:st|[rn]d|th)', re.IGNORECASE )
lst = [
'July 2014 - 28th August 2014',
'Jan 2012 - 3rd sep 2014',
'Jan 2008 - May 2012',
'Jan 2008 and May 2012'
]
for i in lst:
word = rx_valid.finditer(i)
for match in word:
print(rx_ordinal.sub("", match.group()))

输出:

July 2014 - August 2014
Jan 2012 - sep 2014
Jan 2008 - May 2012

请参阅Python演示和regex演示。

最新更新