我正试图从我在夏威夷刮到的关于Covid-19的新闻和政府公告文本中提取日期。我在一个虚拟数据集上运行了一个示例程序,并找到了为页面上的每个数字生成的日期。当我使用"strict=True"根本就没有日期。以下是4行文件的结果。
import datefinder
with open("c:/users/Lnitz/documents/ige2.txt") as file:
for line in file:
matches = datefinder.find_dates(line, source=True)
#print(line)
for match in matches:
print(match, 'xxx', line)
结果:
(datetime.datetime(2020, 11, 19, 0, 0), 'on Nov 19, 2020') xxx Posted on Nov 19, 2020 in COVID-98 News Releases, Latest News, Press 14 Releases
(datetime.datetime(1998, 10, 24, 0, 0), '98') xxx Posted on Nov 19, 2020 in COVID-98 News Releases, Latest News, Press 14 Releases
(datetime.datetime(2021, 10, 14, 0, 0), '14') xxx Posted on Nov 19, 2020 in COVID-98 News Releases, Latest News, Press 14 Releases
(datetime.datetime(2021, 10, 19, 0, 0), '19') xxx Pre-travel COVID-19 testing results must be in hand prior to 3/23/1945 departure for Hawaiʻi
(datetime.datetime(1945, 3, 23, 0, 0), '3/23/1945') xxx Pre-travel COVID-19 testing results must be in hand prior to 3/23/1945 departure for Hawaiʻi
(datetime.datetime(1878, 3, 5, 0, 0), 'Mar 5,1878') xxx Air Canada and WestJet Mar 5,1878 partnering 72 with State of Hawaiʻi for Canadian15 pre-travel 78 testing
(datetime.datetime(1972, 10, 24, 0, 0), '72') xxx Air Canada and WestJet Mar 5,1878 partnering 72 with State of Hawaiʻi for Canadian15 pre-travel 78 testing
(datetime.datetime(1978, 10, 24, 0, 0), '78') xxx Air Canada and WestJet Mar 5,1878 partnering 72 with State of Hawaiʻi for Canadian15 pre-travel 78 testing
datefinder的输出包含源字符串,如果您设置了source=True
,那么后处理如何?例如,对于完整描述的日期(y/m/d),您需要至少6个字符(包括分隔符)和4个数字:
import datefinder
s = """Posted on Nov 19, 2020 in COVID-98 News Releases, Latest News, Press 14 Releases
Pre-travel COVID-19 testing results must be in hand prior to 3/23/1945 departure for Hawaiʻi
Air Canada and WestJet Mar 5,1878 partnering 72 with State of Hawaiʻi for Canadian15 pre-travel 78 testing"""
for l in s.split('n'):
matches = datefinder.find_dates(l, strict=False, source=True)
for m in matches:
if (sum(c.isdigit() for c in m[1]) >= 4) and (len(m[1]) >= 6):
print(f"{l} ->n{m}n")
# Posted on Nov 19, 2020 in COVID-98 News Releases, Latest News, Press 14 Releases ->
# (datetime.datetime(2020, 11, 19, 0, 0), 'on Nov 19, 2020')
# Pre-travel COVID-19 testing results must be in hand prior to 3/23/1945 departure for Hawaiʻi ->
# (datetime.datetime(1945, 3, 23, 0, 0), '3/23/1945')
# Air Canada and WestJet Mar 5,1878 partnering 72 with State of Hawaiʻi for Canadian15 pre-travel 78 testing ->
# (datetime.datetime(1878, 3, 5, 0, 0), 'Mar 5,1878')