团队,
我有一个大约6000多页的pdf文件。我能用什么最快的方法提取文本?
我正在使用此代码
all_text = ""
with pdfplumber.open(pdf_dir) as pdf:
for page in pdf.pages:
text = page.extract_text()
all_text += text
但是完成需要很多时间
同样在提取后,我需要搜索我正在使用的地址代码:
address_line = re.compile(r'(: d{5})')
for line in text.split('n'):
if address_line.search(line):
print(line)
感谢您提前提供的帮助:)
由于不需要将整个文本保存在内存中,只需遍历页面行并收集匹配行即可:
with pdfplumber.open(pdf_dir) as pdf:
matched_lines = []
address_line = re.compile(r'(: d{5})')
for page in pdf.pages:
text = page.extract_text()
for line in text.split('n'):
if address_line.search(line):
matched_lines.append(line)
您可能会发现多处理更高效。下面是一个如何做到这一点的例子:
import pdfplumber
from re import compile
from sys import stderr
from concurrent.futures import ProcessPoolExecutor as PPE
from functools import partial
FILENAME = 'Maki.pdf'
PATTERN = compile(r'(: d{5})')
# return a list of all lines that contain a match of the regular expression
def extract(filename, page):
result = []
try:
with pdfplumber.open(filename) as pdf:
for line in pdf.pages[page].extract_text().split('n'):
if PATTERN.search(line):
result.append(line)
except Exception as e:
print(e, file=stderr)
return result
def main(filename):
with PPE() as ppe, pdfplumber.open(filename) as pdf:
for future in ppe.map(partial(extract, filename), range(len(pdf.pages))):
print(future)
if __name__ == '__main__':
main(FILENAME)
注意:
重写。需要避免串行化