Python多行正则表达式



我正在使用pdfplumb.page.textract_text((从银行对账单中提取文本。文本似乎提取正确,但我在使用正则表达式提取日期、类型、描述和金额时遇到了问题。但我想不出一个干净的方法来捕捉多行描述。我希望将金框中的描述文本与金框前一行的描述文本分组。

Regex图案

re.findall(r'(d{2}/d{2})s*([w ]*)([$d.,]*)(s{2})([$d.,]*).*s(?=w*)', text)

Regex说明

(d{2}/d{2}) - Capture date
([w ]*) - Capture description
([$d.,]*) - Capture expense amount
([$d.,]*) - Capture deposit amount
(?=w*) - Positive Lookahead for any text below

输入

0  0  $12,345.67 
08/27  DEBIT CARD PURCHASE XXXXXX 5541XXXXXX  $1.23  0  $123,456.78
RACETRAC467 00004671 PLEASANTVILLEPA
08/27  BANK FUNDS TRANSFER DB  $45.67  0  $124,816.32
TO SMITH,JOHN
SAVINGS #0001, CONF# 8675309
continued on next page>>>
987654-3210
Page 1 of 11

电流输出

['08/27', 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX  ', '$1.23', '  ', '0', '  $123,456.78 ']
['08/27', 'BANK FUNDS TRANSFER DB  ', '$45.67', '  ', '0', '  $124,816.32 ']

所需输出

['08/27', 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX RACETRAC467 00004671 PLEASANTVILLEPA  ', '$1.23', '  ', '0', '  $123,456.78 ']
['08/27', 'BANK FUNDS TRANSFER DB TO SMITH,JOHN SAVINGS #0001, CONF# 8675309 ', '$45.67', '  ', '0', ' $124,816.32 ']

您可以将以下行的描述(例如,不以日期或"continued"或Page和数字开头(添加到现有的描述中。

在您的模式中,您使用[w ]*,但它也只能匹配空格。如果应该至少有一个单词字符,您可以使用w[w ]*

您也可以省略(s{2})中的捕获组,因为它将只返回一个带有空格的条目。

(?P<date>d{2}/d{2})s+(?P<desc>w[w ]*)(?P<expense>$[d.,]*)s{2}(?P<deposit>d[d.,]*)s.*(?P<desc_more>(?:n(?!d+/d|continuedb|Pages+d).*)*)

模式匹配:

  • (?P<date>d{2}/d{2})日期
  • s+匹配1+空白字符
  • (?P<desc>w[w ]*)desc匹配字字符和空格
  • (?P<expense>$[d.,]*)费用匹配$和可选数字.,
  • s{2}匹配2个空白字符
  • (?P<deposit>d[d.,]*)存款匹配数字和可选数字.,
  • s.*匹配单个空白字符和行的其余部分
  • (?P<desc_more>desc_more
    • (?:非捕获组整体匹配
      • n(?!d+/d|continuedb|Pages+d).*匹配一个换行符,如果它不是以类似日期的模式或任何其他选项开头,则匹配该行的其余部分
    • )*关闭非捕获组并可选择重复
  • )关闭组desc_more

查看regex演示和Python演示。

使用命名捕获组和match.groupdict():的示例

import re
pattern = r"(?P<date>d{2}/d{2})s+(?P<desc>w[w ]*)(?P<expense>$[d.,]*)s{2}(?P<deposit>d[d.,]*)s.*(?P<desc_more>(?:n(?!d+/d|continuedb|Pages+d).*)*)"
s = ("  0  0  $12,345.67 n"
"08/27  DEBIT CARD PURCHASE XXXXXX 5541XXXXXX  $1.23  0  $123,456.78n"
"RACETRAC467 00004671 PLEASANTVILLEPAn"
"08/27  BANK FUNDS TRANSFER DB  $45.67  0  $124,816.32n"
"TO SMITH,JOHNn"
"SAVINGS #0001, CONF# 8675309n"
"continued on next page>>>n"
" 987654-3210n"
"Page 1 of 11n"
"07/27  DEBIT CARD PURCHASE XXXXXX 6541XXXXXX  $2.23  0  $223,456.78")
matches = re.finditer(pattern, s)
for _, match in enumerate(matches):
d = match.groupdict()
d.update({'desc': re.sub(r"[^Sn]*n", " " , match.groupdict().get('desc') + match.groupdict().get('desc_more'))})
del d["desc_more"]
print(d)

输出

{'date': '08/27', 'desc': 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX RACETRAC467 00004671 PLEASANTVILLEPA', 'expense': '$1.23', 'deposit': '0'}
{'date': '08/27', 'desc': 'BANK FUNDS TRANSFER DB TO SMITH,JOHN SAVINGS #0001, CONF# 8675309', 'expense': '$45.67', 'deposit': '0'}
{'date': '07/27', 'desc': 'DEBIT CARD PURCHASE XXXXXX 6541XXXXXX  ', 'expense': '$2.23', 'deposit': '0'}

最新更新