提取日期范围来自 OCR 数据提取

我的正则表达式返回一个项目列表，我只需要从中获取日期范围。该列表并不总是具有特定索引的日期范围。

我尝试先将列表转换为字符串，然后仅提取日期范围：

possible_billing_periods = list(re.findall(r'Billing Period: (.*)|Billing period: (.*)|Billing Period (.*)|Billing period (.*)|period (.*)|period: (.*)', data))  
billing_period           = str(possible_billing_periods)
    for k in billing_period.split("n"):
        if k != ['(A-Za-Z0-9)']:
            billing_period_2 = re.sub(r"[^a-zA-Z0-9]+", ' ', k) 
    print(possible_billing_periods)

输出： [("， "， "， "电话">

， "(， ("2018年6月21日 - 2018年9月25日"， "， "(]

预期成果：21-june-2018 25-September-2018

得到的结果：Tel 21 june 2018 25 September 2018

示例数据：
28 八月2018 开始指数： B1 0
2018年8月28日开始指数： E1 0
计费周期：2018 年 6 月 21 日 - 2018
年 9 月 25 日预计下一次阅读：2018年12月25

日

根据示例数据的大小，正则表达式可能不是检索信息的最佳方式(性能方面(。

假设所需的日期字符串始终在以 'Billing Period' 开头的行中，您可以尝试这样的事情：

sample_data = """28 August2018 Start Index: B1 0
28 August 2018 Start Index: E1 0
Billing Period: 21-june-2018 - 25-September-2018
Expected next reading: 25 December 2018"""
billing_periods = list()
line_start = {'Billing':0, 'period':0, 'period:':0}
for line in sample_data.split('n'):
    if line.split()[0] in line_start:
        billing_periods.append((line.split()[-3], line.split()[-1]))
print(billing_periods)

输出：

[("2018 年 6 月 21 日"、"2018 年 9 月 25 日"(]

字典line_start使您能够定义几个可能的行起始字符。

我猜数据来自文件，因此逐行处理它是最简单的。以下是处理文件的常用方法的伪代码：

for each line in the file:
    if it is a line we care about:
        process the line

从示例数据来看，我们关注的行以"计费周期："的一些变体开始。下面是一个正则表达式，用于查找以示例代码中的任何变体开头的行。开头的？x 等效于 re。详细标志。它告诉正则表达式编译器忽略空格，以便我可以展开正则表达式的各个部分，并通过一些注释解释发生了什么。

billing_period_re = re.compile(r"""
   (?xi)            # ignorecase and verbose
   ^                # match at the begining of the string
   s*
   (?:Billing)?     # optional Billing. (?: ...) means don't save the group
   s*
   Period                      
   s*
   :?               # optional colon
   s*
   """)

现在，如果计费周期正则表达式匹配，那么我们需要找到一个日期范围。根据示例数据，日期范围是用"-"分隔的两个日期。日期是 1-2 位数字的日期、月份名称和以"-"分隔的 4 位年份。以下是为日期范围构建正则表达式的一种方法：

day   = r"d{1,2}"
month = r"(?:january|february|march|april|may|june|july|august|september|october|november|december)"
year  = r"d{4}"
date = rf"{day}-{month}-{year}"
date_range_re = re.compile(rf"(?i)(?P<from>{date}) - (?P<to>{date})")

将一切整合在一起

# this could be for line in input_file:
for line in data.splitlines():
    # check if it's a billing period line
    linematch = billing_period_re.search(line)
    if linematch:
        # check if there is a date range
        date_range = date_range_re.search(line, linematch.end())
        if date_range:
            print(f"from: {date_range['from']} to: {date_range['to']}")

相关内容

最新更新

热门标签：