我正在使用pdfplumb.page.textract_text((从银行对账单中提取文本。文本似乎提取正确,但我在使用正则表达式提取日期、类型、描述和金额时遇到了问题。但我想不出一个干净的方法来捕捉多行描述。我希望将金框中的描述文本与金框前一行的描述文本分组。
Regex图案
re.findall(r'(d{2}/d{2})s*([w ]*)([$d.,]*)(s{2})([$d.,]*).*s(?=w*)', text)
Regex说明
(d{2}/d{2}) - Capture date
([w ]*) - Capture description
([$d.,]*) - Capture expense amount
([$d.,]*) - Capture deposit amount
(?=w*) - Positive Lookahead for any text below
输入
0 0 $12,345.67
08/27 DEBIT CARD PURCHASE XXXXXX 5541XXXXXX $1.23 0 $123,456.78
RACETRAC467 00004671 PLEASANTVILLEPA
08/27 BANK FUNDS TRANSFER DB $45.67 0 $124,816.32
TO SMITH,JOHN
SAVINGS #0001, CONF# 8675309
continued on next page>>>
987654-3210
Page 1 of 11
电流输出
['08/27', 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX ', '$1.23', ' ', '0', ' $123,456.78 ']
['08/27', 'BANK FUNDS TRANSFER DB ', '$45.67', ' ', '0', ' $124,816.32 ']
所需输出
['08/27', 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX RACETRAC467 00004671 PLEASANTVILLEPA ', '$1.23', ' ', '0', ' $123,456.78 ']
['08/27', 'BANK FUNDS TRANSFER DB TO SMITH,JOHN SAVINGS #0001, CONF# 8675309 ', '$45.67', ' ', '0', ' $124,816.32 ']
您可以将以下行的描述(例如,不以日期或"continued"或Page和数字开头(添加到现有的描述中。
在您的模式中,您使用[w ]*
,但它也只能匹配空格。如果应该至少有一个单词字符,您可以使用w[w ]*
您也可以省略(s{2})
中的捕获组,因为它将只返回一个带有空格的条目。
(?P<date>d{2}/d{2})s+(?P<desc>w[w ]*)(?P<expense>$[d.,]*)s{2}(?P<deposit>d[d.,]*)s.*(?P<desc_more>(?:n(?!d+/d|continuedb|Pages+d).*)*)
模式匹配:
(?P<date>d{2}/d{2})
组日期s+
匹配1+空白字符(?P<desc>w[w ]*)
组desc匹配字字符和空格(?P<expense>$[d.,]*)
组费用匹配$
和可选数字.
或,
s{2}
匹配2个空白字符(?P<deposit>d[d.,]*)
组存款匹配数字和可选数字.
或,
s.*
匹配单个空白字符和行的其余部分(?P<desc_more>
组desc_more(?:
非捕获组整体匹配n(?!d+/d|continuedb|Pages+d).*
匹配一个换行符,如果它不是以类似日期的模式或任何其他选项开头,则匹配该行的其余部分
)*
关闭非捕获组并可选择重复
)
关闭组desc_more
查看regex演示和Python演示。
使用命名捕获组和match.groupdict()
:的示例
import re
pattern = r"(?P<date>d{2}/d{2})s+(?P<desc>w[w ]*)(?P<expense>$[d.,]*)s{2}(?P<deposit>d[d.,]*)s.*(?P<desc_more>(?:n(?!d+/d|continuedb|Pages+d).*)*)"
s = (" 0 0 $12,345.67 n"
"08/27 DEBIT CARD PURCHASE XXXXXX 5541XXXXXX $1.23 0 $123,456.78n"
"RACETRAC467 00004671 PLEASANTVILLEPAn"
"08/27 BANK FUNDS TRANSFER DB $45.67 0 $124,816.32n"
"TO SMITH,JOHNn"
"SAVINGS #0001, CONF# 8675309n"
"continued on next page>>>n"
" 987654-3210n"
"Page 1 of 11n"
"07/27 DEBIT CARD PURCHASE XXXXXX 6541XXXXXX $2.23 0 $223,456.78")
matches = re.finditer(pattern, s)
for _, match in enumerate(matches):
d = match.groupdict()
d.update({'desc': re.sub(r"[^Sn]*n", " " , match.groupdict().get('desc') + match.groupdict().get('desc_more'))})
del d["desc_more"]
print(d)
输出
{'date': '08/27', 'desc': 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX RACETRAC467 00004671 PLEASANTVILLEPA', 'expense': '$1.23', 'deposit': '0'}
{'date': '08/27', 'desc': 'BANK FUNDS TRANSFER DB TO SMITH,JOHN SAVINGS #0001, CONF# 8675309', 'expense': '$45.67', 'deposit': '0'}
{'date': '07/27', 'desc': 'DEBIT CARD PURCHASE XXXXXX 6541XXXXXX ', 'expense': '$2.23', 'deposit': '0'}