将多个PDF解析为数据帧

如何将整个PDF(一个文件夹中的多个PDF(的内容复制到一个单元格(如B列(中，并将文件名复制到a列？现在，这段代码解析所有PDF，但PDF中的每一行都保存在数据帧中的一行。我需要每个PDF作为一行。

from pathlib import Path
import fitz
import pandas as pd
# returns all file paths that has .pdf as extension in the specified directory
fold = "C:/Users/talen/OneDrive/Application Development/data/ForParse/"
pdf_search = Path(fold).glob("*.pdf")
# convert the glob generator out put to list
pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]
pdf_txt = ""
for pdf in pdf_files:
with fitz.open(pdf) as doc:

for page in doc:
pdf_txt += page.getText()

with open('pdf_txt.txt','w', encoding='utf-8') as f: #Converting to text file
f.write(pdf_txt)
data=pd.read_table('pdf_txt.txt', lineterminator='n')  #Converting text file to dataframe
print(data)

我还尝试使用"；sep="\n"；这给了我一个错误：ValueError：指定为分隔符或分隔符。这将强制python引擎不接受行终止符。因此，不允许使用行终止符作为分隔符。

首先，您不需要将PDF文件转换为文本文件。相反，您可以直接将PDF文件的文本粘贴到数据帧的任何单元格中。

创建一个空列表textStr=[]以使用textStr.append(Page.get_text("text").replace('n',' '))存储PDF文件的文本。在这里，您需要遍历PDF文件的页面
将列表textStr=[]中的项目连接起来以形成字符串Text=' '.join(textStr)
现在将字符串Text粘贴到数据帧中的任何位置，例如df.at[1,'B']=Text给出的第二行(1(和第B列

相关内容

最新更新

热门标签：