如何从PDF中提取表格数据



我的目标是将.pdf文件处理到内存中。问题是输出忽略了表,这导致了字符串的连接。

该库使用了:https://github.com/ledongthuc/pdf

代码

package main
import (
"bytes"
"fmt"
"github.com/ledongthuc/pdf"
)
func main() {
pdf.DebugOn = true
content, err := readPdf("accountnumberJul2022.pdf") // Read local pdf file
if err != nil {
panic(err)
}
fmt.Println(content)
return
}
func readPdf(path string) (string, error) {
f, r, err := pdf.Open(path)
// remember close file
defer f.Close()
if err != nil {
return "", err
}
var buf bytes.Buffer
b, err := r.GetPlainText()
if err != nil {
return "", err
}
buf.ReadFrom(b)
return buf.String(), nil
}

PDF文件:https://drive.google.com/file/d/14RFll7pZ8_J8ua-NDrw31QHe-4N16IJL/view?usp=sharing

输出

DATEDESCRIPTIONBRENTRYBALANCE01/07Beginning Balance1,000.0002/07TRSFE-BANKINGDB0207/DBXOA/SB24313/Q0321XXXXX56LAWSON1 DB999.0003/07TRSFE-BANKINGDB0307/DBXOA/SB24313/Q0321XXXXX56LAWSON2 DB997.0004/07TRSFE-BANKINGDB0407/DBXOA/SB24313/Q0321XXXXX56LAWSON3 DB994.0005/07TRSFE-BANKINGDB0507/DBXOA/SB24313/Q0321XXXXX56LAWSON4 DB990.0006/07TRSFE-BANKINGDB0607/DBXOA/SB24313/Q0321XXXXX56LAWSON5 DB985.0007/07TRSFE-BANKINGDB0707/DBXOA/SB24313/Q0321XXXXX56LAWSON6 DB979.0008/07TRSFE-BANKINGDB0807/DBXOA/SB24313/Q0321XXXXX56LAWSON7 DB972.0009/07TRSFE-BANKINGDB0907/DBXOA/SB24313/Q0321XXXXX56LAWSON8 DB964.00Continued on next pageDATEDESCRIPTIONBRENTRYBALANCE10/07TRSFE-BANKINGDB1007/DBXOA/SB24313/Q0321XXXXX56LAWSON9 DB955.0011/07TRSFE-BANKINGDB1107/DBXOA/SB24313/Q0321XXXXX56LAWSON10 DB945.0012/07TRSFE-BANKINGDB1207/DBXOA/SB24313/Q0321XXXXX56LAWSON11 DB934.0013/07TRSFE-BANKINGDB1307/DBXOA/SB24313/Q0321XXXXX56LAWSON12 DB922.0014/07TRSFE-BANKINGDB1407/DBXOA/SB24313/Q0321XXXXX56LAWSON13 DB909.0015/07TRSFE-BANKINGDB1507/DBXOA/SB24313/Q0321XXXXX56LAWSON14 DB895.0016/07INTEREST1517/07INTERESTTAX1909.00

我尝试过的:

  1. 我还尝试了按行分组的读取文本的示例,并将fmt.Println(word.S)更改为fmt.Print(word.S)

但是,的输出更不可读

>>>> row:  0
ATEDESCRIPTIONBRENTRYBALANCEBe00.ginning Balance1,000.00469NOSWTRSFE-BANKINGDB0207/DBXOA/SB24313/Q0321XXXXX56LAWSONAL659X99.00XXXXXTRSFE-BANKINGDB0307/DBXOA/SB24313/Q0321XXXXX56LAWSON1230997.00Q/313TRSFE-BANKINGDB0407/DBXOA/SB24313/Q0321XXXXX56LAWSON42BS994.00/AOXBTRSFE-BANKINGDB0507/DBXOA/SB24313/Q0321XXXXD56LAWSOND/70990.0090BDGTRSFE-BANKINGDB0607/DBXOA/SB24313/Q0321XXXXX56LAWSONNIKN985.00AB-EFTRSFE-BANKINGDB0707/DBXOA/SB24313/Q0321XXXXX56LAWSONSRT0979.000.279TRSFE-BANKINGDB0807/DBXOA/SB24313/Q0321XXXXX56LAWSON701/70080009/0770270/70/0770/6003/070/507/4/03 DBBD 2B4 DB5D 1 DB68 DB DBBD 7oontinued Cn next page>>>> row:  0
TATEDESCRIPTIONBRENTRYBALANCE00.90TRSFE-BANKINGDB1007/DBXOA/SB24313/Q0321XXXXX56LAWSON9XAT955.00TSERETRSFE-BANKINGDB1107/DBXOA/SB24313/Q0321XXXXX56LAWSONTNITS945.00ERETNTRSFE-BANKINGDB1207/DBXOA/SB24313/Q0321XXXXX56LAWSONI00.5934.0098NO/DSRSFE-BANKINGDB1307/DBXOA/SB24313/Q0321XXXXX56LAWSONWAL65922.00XXXXXTRSFE-BANKINGDB1407DBXOA/SB24313/Q0321XXXXX56LAWSON1230Q909.00/3134TRSFE-BANKINGDB1507/DBXOA/SB20/5110/0770/410707/60170//7137712/107/1141B3 BDBDDBD 1 0112 BD 11D9B 511

PDF文件在设计上并不意味着机器可读。PDF以及它们的结构可能会因不同而异。所以我怀疑是否会有";固体溶液";用于解析任意PDF文件。PDF文件不是必需的";结构化的";比如PDF可能来自的原始电子表格文件。它更像矢量图形,因为它只包含在正确位置绘制字符的位置和命令,而不是包含文本本身。

在您的情况下,您的特定PDF文件似乎结构良好。使用qpdf提取内容表明:

# part of the pdf content extracted, comments (#) added by me.
BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0027> Tj  # D
10.5842743 0 Td <0024> Tj   # A
8.6870575 0 Td <0037> Tj    # T
8.9526215 0 Td <0028> Tj    # E
ET
Q
Q
q
147.75 87.296265 149.25 23.148926 re
W* n
q
.75 0 0 .75 152.25 92.546265 cm
/G3 gs
BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0027> Tj    # D
10.5842743 0 Td <0028> Tj     # E
9.7756042 0 Td <0036> Tj      # S
9.7756042 0 Td <0026> Tj      # C
10.5842743 0 Td <0035> Tj     # R
10.5842743 0 Td <002C> Tj     # I
4.0719757 0 Td <0033> Tj      # P
9.7756042 0 Td <0037> Tj      # T
8.9526215 0 Td <002C> Tj      # I
4.0719757 0 Td <0032> Tj      # O
11.4001007 0 Td <0031> Tj     # N
ET
# some part skipped......
BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0037> Tj    # T 
8.9526215 0 Td <0035> Tj      # R
10.5842743 0 Td <0036> Tj     # S
9.7756042 0 Td <0029> Tj      # F
ET
Q
q
.75 0 0 .75 152.25 152.993042 cm
/G3 gs
BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0028> Tj    # E
9.7756042 0 Td <0010> Tj      # -
4.8806458 0 Td <0025> Tj      # B
9.7756042 0 Td <0024> Tj      # A
9.7756042 0 Td <0031> Tj      # N
10.5842743 0 Td <002E> Tj     # K
9.7756042 0 Td <002C> Tj      # I
4.0719757 0 Td <0031> Tj      # N
10.5842743 0 Td <002A> Tj     # G
ET
Q
q
.75 0 0 .75 152.25 165.641968 cm
/G3 gs
BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0027> Tj    # D
10.5842743 0 Td <0025> Tj     # B
ET

BT=开始文本
ET=结束文本

编写与您使用的库类似的程序https://github.com/ledongthuc/pdf或者直接修改库以将一对BT和ET之间的任何内容解析为单个文本应该是微不足道的最困难的部分是恢复电子表格的列和行信息(即,哪些文本属于哪个字段(因为在PDF阅读器的眼中,电子表格的线条只是一堆任意的线条,有时也可能是任意的矩形。

这是我写的一个演示程序,它首先找到所有矩形,然后将所有文本放入相应的矩形中,然后排序&在每个字段中插入文本,形成最终结果。

func readPdf(path string) {
r, err := pdf.Open(path)
panic(err)
// extract all rectangles
var fieldRects []FieldRect
p := r.Page(1)
c := p.Content()
// font := p.Font(p.Fonts()[0])
// fmt.Printf("font.Widths(): %vn", font.Widths())
for _, r := range c.Rect {
fieldRects = append(fieldRects, FieldRect{
rect:  r,
texts: nil,
})
}
// put text(glyph) into their corresponding rectangles
for _, t := range c.Text {
for i := range fieldRects {
fr := &fieldRects[i]
if fr.rect.Min.X < t.X && fr.rect.Min.Y < t.Y &&
fr.rect.Max.X > t.X && fr.rect.Max.Y > t.Y {
fr.texts = append(fr.texts, t)
}
}
}
// these values can also be derived from font size to gain
// even more robustness
const NEWLINE_TOLERATION = 2
// unfortunately the pdf you sent does not have proper font
// width information, so this is the best we can get without
// inferring width information from the glyph shape itself.
const SPACE_TOLERATION = 11
// sort text(glyph) by position within rectangles, then concat
for i := range fieldRects {
fr := &fieldRects[i]
sort.Slice(fr.texts, func(i, j int) bool {
deltaY := fr.texts[i].Y - fr.texts[j].Y
if math.Abs(deltaY) < NEWLINE_TOLERATION { // tolerate some vertical deviation
return fr.texts[i].X < fr.texts[j].X // on the same line
}
return deltaY > 0 // not on the same line
})
for _, f := range fr.texts {
if fr.lastPos != nil {
if fr.lastPos.Y-f.Y > NEWLINE_TOLERATION { // new line
fr.resultText += "n"
}
if f.X-fr.lastPos.X > SPACE_TOLERATION { // space
fr.resultText += " "
}
}
fr.resultText += f.S
fr.lastPos = &pdf.Point{X: f.X, Y: f.Y}
}
if fr.resultText == "" {
continue
}
fmt.Printf("====== pos: %v, %v; text: n%sn", fr.rect.Min, fr.rect.Max, fr.resultText)
}
}

由于您发送的PDF文件中缺少字体宽度信息,因此没有简单的方法来实现可靠的空间检测。这个程序产生了一个可读但不太好的结果:

====== pos: {0 0}, {794 1123}; text: 
DATE DESCRIPTION BR ENTRY BALANCE
01/07 Beginning Balance 1,000.00
02/07 TRSF 0207/DBXO 1 DB 999.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
03/07 TRSF 0307/DBXO 2 DB 997.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
04/07 TRSF 0407/DBXO 3 DB 994.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
05/07 TRSF 0507/DBXO 4 DB 990.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
06/07 TRSF 0607/DBXO 5 DB 985.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
07/07 TRSF 0707/DBXO 6 DB 979.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
08/07 TRSF 0807/DBXO 7 DB 972.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
09/07 TRSF 0907/DBXO 8 DB 964.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
Continued on next page
====== pos: {372.75 87.296265}, {447 110.44519100000001}; text: 
Continue
====== pos: {447.75 87.296265}, {522 110.44519100000001}; text: 
d on next page
====== pos: {147.75 111.19519}, {297 134.34411599999999}; text: 
X56
LAWSON
====== pos: {147.75 135.094116}, {222 183.540893}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 135.094116}, {297 208.83874500000002}; text: 
X56
LAWSON
0907/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 209.58875}, {222 258.035527}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 209.58875}, {297 283.33337900000004}; text: 
X56
LAWSON
0807/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 284.08337}, {222 332.530147}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 284.08337}, {297 357.827999}; text: 
X56
LAWSON
0707/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 358.578}, {222 407.024777}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 358.578}, {297 432.322629}; text: 
X56
LAWSON
0607/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 433.07263}, {222 481.519407}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 433.07263}, {297 506.81725900000004}; text: 
X56
LAWSON
0507/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 507.56726}, {222 556.0140369999999}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 507.56726}, {297 581.311889}; text: 
X56
LAWSON
0407/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 582.06189}, {222 630.508667}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 582.06189}, {297 655.806519}; text: 
X56
LAWSON
0307/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 656.55652}, {222 705.003297}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 656.55652}, {297 730.301149}; text: 
nce
0207/DBXO
A/SB24313/
Q0321XXXX

另一个不起作用的原因是,从文档中提取的矩形与PDF阅读器中的单元格外观不匹配。

事实上,我相信如果你能找到一种方法来找到所有的单元格,并将文档页面划分为正确定位和大小的单元格,这几乎可以完美地工作。(手动或自动(也许可以使用将一对BT和ET之间的任何内容视为单个文本的技巧,而不是依赖于字形之间的相对位置来做到这一点。

但请记住,即使你做到了,这仍然只适用于这种特定的格式,并且(因为没有更好的词(";风味";由该特定软件创建的PDF,并且不太可能在其他任何软件上都能很好地工作。

也有可用的商业解决方案,如背后的框架https://www.ilovepdf.com/pdf_to_excel我发现它对你的特定文件非常有效。这些解决方案往往更强大、更可靠,但购买起来确实要花钱。其中一些可以在网上使用,所以也许想办法使用他们的api可能是一个可行的替代方案。(如果他们的服务条款允许你这样做,那就是。(

相关内容

  • 没有找到相关文章

最新更新