我有一个PDF文件,它的第一页数据格式不同,但其他页面的表格格式相同。我想使用Python Tabula将这个有多个页面的PDF文件转换为CSV文件。
当前代码能够将PDF转换为CSV,如果PDF只有2页,并且如果它有两页以上,则会出现超出范围的错误。
我想计算一个PDF文件的PDF页面总数,根据相同的情况,我希望python脚本将不同数据帧的PDF转换为CSV。
我正在使用Linuxbox来运行这个python脚本。
代码如下所示:
#!/usr/bin/env python3
import tabula
import pandas as pd
import csv
pdf_file='/root/scripts/pdf2xls/Test/21KJAZP011.pdf'
column_names=['Product','Batch No','Machin No','Time','Date','Drum/Bag No','Tare Wt.kg','Gross Wt.kg',
'Net Wt.kg','Blender','Remarks','Operator']
df_results=[] # store results in a list
# Page 1 processing
try:
df1 = tabula.read_pdf('/root/scripts/pdf2xls/Test/21KJAZP011.pdf', pages=1,area=(95,20, 800, 840),columns=[93,180,220,252,310,315,333,367,
410,450,480,520]
,pandas_options={'header': None}) #(top,left,bottom,right)
df1[0]=df1[0].drop(columns=5)
df1[0].columns=column_names
df_results.append(df1[0])
df1[0].head(2)
except Exception as e:
print(f"Exception page not found {e}")
# Page 2 processing
try:
df2 = tabula.read_pdf('/root/scripts/pdf2xls/Test/21KJAZP011.pdf', pages=2,area=(10,20, 800, 840),columns=[93,180,220,252,310,315,330,370,
410,450,480,520]
,pandas_options={'header': None}) #(top,left,bottom,right)
row_with_Sta = df2[0][df2[0][0] == 'Sta'].index.tolist()[0]
df2[0] = df2[0].iloc[:row_with_Sta]
df2[0]=df2[0].drop(columns=5)
df2[0].columns=column_names
df_results.append(df2[0])
df2[0].head(2)
except Exception as e:
print(f"Exception page not found {e}")
#res:wult = pd.concat([df1[0],df2[0],df3[0]]) # concate both the pages and then write to CSV
result = pd.concat(df_results) # concate list of pages and then write to CSV
result.to_csv("result.csv")
with open('/root/scripts/pdf2xls/Test/result.csv', 'r') as f_input, open('/root/scripts/pdf2xls/Test/FinalOutput_21KJAZP011.csv', 'w') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
csv_output.writerow(next(csv_input)) # write header
for cols in csv_input:
for i in range(7, 9):
cols[i] = '{:.2f}'.format(float(cols[i]))
csv_output.writerow(cols)
请建议如何实现同样的目标。我对Python非常陌生,因此无法将事物组合在一起。
尝试pdfpumberhttps://github.com/jsvine/pdfplumber,像一个魅力一样为我工作
pdffile = 'your file'
with pdfplumber.open(pdffile) as pdf:
for i in range(len(pdf.pages)):
first_page = pdf.pages[i]
rawdata = first_page.extract_table()
使用tabula使用Multiple_Tables选项从PDF中提取多个表
multiple_tables=True
from tabula import convert_into
table_file = r"PDF_path"
output_csv = r"out_csv"
df = convert_into(table_file, output_csv, output_format='csv', lattice=False, stream=True, multiple_tables=True, pages="all")