python中关于依赖关系(colab和local)的表格错误



我正致力于从python中的许多pdf文档中提取数据,在colab中进行测试。一个解决方案将是伟大的合作,但如果这是不可能的地方。每页都有很多有趣的条目,所以我选择了表格。

代码对大多数文件工作得很好,但对其他文件崩溃…

我可以在colab中导入丢失的.jar等,或者如果不是,如何在本地安装它以运行?

提前感谢!

Got stderr: Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
WARNING: New fonts found, font cache will be re-built
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Building on-disk font cache, this may take a while
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Finished building on-disk font cache, found 17 fonts
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
WARNING: Using fallback font 'LiberationSerif' for 'TimesNewRomanPSMT'
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
... (multiple lines)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-10-987da78e7e88> in <module>()
2 regions = []
3 for i in range(0,len(regions_raw)):
----> 4     regions.append(regions_raw[i]['data'][0][0]['text'])
5 
IndexError: list index out of range

代码:(只打印一个区域,大部分来自# https://towardsdatascience.com/how-to-extract-tables-from-pdf-using-python-pandas-and-tabula-py-c65e43bd754)

import tabula as tb
from tabula import read_pdf
import PyPDF2 # just for pagecount
from PyPDF2 import PdfFileReader
box = [2,0,4,13]
fc = 28.28       
for i in range(0, len(box)):
box[i] *= fc
for filename in (files):
pdftemp=open(filename,'rb')
pdfReader = PyPDF2.PdfFileReader(pdftemp)
pagestmp=pdfReader.getNumPages()
pages=[i+3 for i in range(pagestmp-2)] #leave out first 2 pages
regions_raw = tb.read_pdf(filename, pages=pages,area=[box],output_format="json")
regions = []
for i in range(0,len(regions_raw)):
regions.append(regions_raw[i]['data'][0][0]['text'])
print(regions)

哦,我知道了。工作,只是一些数据开始一页后(在第4页)。";data&;"中的空条目崩溃,导致错误。

最新更新