使用Tabula和Python读取PDF文件时缺少数据

我有一个包含多个文本和表格的pdf，其中一行包含以下内容：

PDF content :
Id: 5647484848 Name Alex J

现在我正在使用tabula-py来解析内容，但结果缺少一些东西(意味着您可以看到第一个字符或数字丢失(。

实际上，我的原始pdf有很多文本和表格。我也在其他行上尝试过，在那里我得到了正确的结果。

Wrong Result :
['', '', 'Id:', '', '647484848', 'Name', '', 'lex J', '', '', '']
Should be :
['', '', 'Id:', '', '5647484848', 'Name', '', 'Alex J', '', '', '']

样本：

# to get the exact row to find the name & index [7] is for Name
if len(row) == 11:
if "Name" in row:
print(row[7])
return Student(studentname=row[7])

在白板中，我有阅读表，我设置了

df = tabula.read_pdf(pdf, output_format='json', pages='all',
password=secure_password, lattice=True)

该行是简单的文本类型，没有图像和所有。不知道为什么它对于此特定行数据失败。我已经将类似的逻辑应用于我得到正确结果的其他行。请指教。

通过将tabula-py中的提取模式从格=真更改为格=假来解决

df = tabula.read_pdf(pdf, output_format='json', pages='all',
password=secure_password, lattice=False)

相关内容

最新更新

热门标签：