从PDF文件中提取文本的更快方法



团队,

我有一个大约6000多页的pdf文件。我能用什么最快的方法提取文本?

我正在使用此代码

all_text = ""
with pdfplumber.open(pdf_dir) as pdf:
for page in pdf.pages:
text = page.extract_text()
all_text += text

但是完成需要很多时间

同样在提取后,我需要搜索我正在使用的地址代码:

address_line = re.compile(r'(:  d{5})')
for line in text.split('n'):
if address_line.search(line):
print(line)

感谢您提前提供的帮助:)

由于不需要将整个文本保存在内存中,只需遍历页面行并收集匹配行即可:

with pdfplumber.open(pdf_dir) as pdf:
matched_lines = []
address_line = re.compile(r'(:  d{5})')
for page in pdf.pages:
text = page.extract_text()
for line in text.split('n'):
if address_line.search(line):
matched_lines.append(line)

您可能会发现多处理更高效。下面是一个如何做到这一点的例子:

import pdfplumber
from re import compile
from sys import stderr
from concurrent.futures import ProcessPoolExecutor as PPE
from functools import partial
FILENAME = 'Maki.pdf'
PATTERN = compile(r'(:  d{5})')
# return a list of all lines that contain a match of the regular expression
def extract(filename, page):
result = []
try:
with pdfplumber.open(filename) as pdf:
for line in pdf.pages[page].extract_text().split('n'):
if PATTERN.search(line):
result.append(line)
except Exception as e:
print(e, file=stderr)
return result
def main(filename):
with PPE() as ppe, pdfplumber.open(filename) as pdf:
for future in ppe.map(partial(extract, filename), range(len(pdf.pages))):
print(future)
if __name__ == '__main__':
main(FILENAME)

注意:

重写。需要避免串行化

最新更新