我有一个有100个PDF文件的文件夹。所有PDF文件的第1页包含一个我正在提取的表。然后,我将所有表连接到一个数据框中,并将其写入CSV文件。然而,我得到错误,而连接。
import os
import camelot
import pandas as pd
import PyPDF2
import tabula
# Set the directory path where the PDF files are located
dir_path = "my/path/"
# Create an empty list to store the tables
tables = []
# Loop through each file in the directory
for filename in os.listdir(dir_path):
# Check if the file is a PDF file
if filename.endswith(".pdf"):
# Open the PDF file
with open(os.path.join(dir_path, filename), "rb") as pdf_file:
# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# Get the first page of the PDF file
page = pdf_reader.getPage(0)
# Extract the table from the first page using tabula-py
table = tabula.read_pdf(pdf_file, pages=1, pandas_options={"header": True})
print(table)
# Append the table to the tables list
tables.append(table)
# Concatenate all tables into a single DataFrame
df = pd.concat(tables)
# Write the DataFrame to a CSV file
df.to_csv("Output.csv", index=False)
TypeError:无法连接类型为'<类'list'>'的对象;只有Series和DataFrame对象是有效的
表格。read_pdf返回一个数据名列表,因此在您的代码中,tables
包含一个数据名列表的列表。
对于Pandas concat工作,您必须首先平坦tables
,像这样:
df = pd.concat([table for sub_list in tables for table in sub_list])
呼叫tabula.read_pdf
时不要忘记设置output_format="dataframe"