有没有什么方法可以用R提取pdf表格的整洁度



我需要一个自动代码来提取R.中的pdf表

所以我在网站上搜索,找到了tabulizer包。

我用

extract_tables(f2,pages = 25,guess=TRUE,encoding = 'UTF-8',method="stream")#f2 is pdf file name

我尝试了每种方法类型,但结果并不整齐。

一些列是混合的,有很多空白,正如您可以看到的图像文件。

我想我会直接修改数据。但其目的是将其自动化,因此需要通用的方法。并不是每个pdf文件都是有组织的。有些桌子非常整洁,每一条相关的线都完美匹配,但另一些则不然。。正如您在我的结果图中看到的,在第4列中,数字混合在同一列中。其他列,数字是一个接一个匹配的,我的意思是,我想让列自动整理成pdf格式的表格。

有没有什么包装或方法可以让提取的桌子整洁?

我的代码结果

PDF 中的表格

使用以下代码,我已经能够提取表中的数字。首先,我将图像转换为PDF文件。之后,我将PDF文件转换为word文件。我终于从word文件中提取了这些表格。此解决方案仅适用于Windows。

library(RDCOMClient)
library(magick)
path_PDF <- "D:\image_Stackoverflow79.pdf"
path_PNG <- "D:\Dropbox\Reponses_Stackoverflow\image_Stackoverflow79.png"
path_Word <- "D:\image_Stackoverflow79.docx"
pdf(path_PDF, height = 8, width = 6)
im <- image_read(path_PNG)
im <- image_crop(im, geometry = geometry_area(width = 510, height = 310, x_off = 100, y_off = 110))
plot(im)
dev.off()
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)

nb_Row <- doc$tables(1)$Rows()$Count()
nb_Col <- doc$tables(1)$Columns()$Count()
mat_Temp <- matrix(NA, nrow = nb_Row, ncol = nb_Col)
for(i in 1 : nb_Row)
{
for(j in 1 : nb_Col)
{
mat_Temp[i, j] <- tryCatch(doc$tables(1)$cell(i, j)$range()$text(), error = function(e) NA)
}
}
mat_Temp 
[,1]   [,2]        [,3]         [,4]         [,5]        [,6]        [,7]        [,8]  
[1,] "ra" "ra"      "ra"       "ra"       "ra"      "ra"      "ra"      "ra"
[2,] "ra" "0.46ra"  "0.46ra"   "0.46ra"   "0.46ra"  "0.46ra"  "0.46ra"  "ra"
[3,] "ra" "1.00ra"  "0.00ra"   "0.98ra"   "0.03ra"  "0.95ra"  "0.85ra"  NA    
[4,] "ra" "0.025ra" "0.025ra"  "0.025ra"  "0.025ra" "0.025ra" "0.025ra" NA    
[5,] "ra" "0.005ra" "0.005ra"  "0.005ra"  "0.005ra" "0.005ra" "0.005ra" NA    
[6,] "ra" "1.49ra"  "0.49ra"   "1.47ra"   "0.52ra"  "1.44ra"  "1.34ra"  "ra"
[7,] "ra" "0.002ra" "0.002ra"  "0.002ra"  "0.002ra" "0.002ra" "0.002ra" "ra"
[8,] "ra" "1.492ra" "0.492ra"  "1472ra"   "0.522ra" "1.442ra" "1.342ra" "ra"
[9,] "ra" "1.59ra"  "ra"       "1.22ra"   "ra"      "ra"      "ra"      "ra"
[10,] "ra" "1.493ra" "0.493ra"  "1473ra"   "0.523ra" "1.443ra" "1.343ra" "ra"
[11,] "ra" "0.107ra" "o. 108ra" "o. 105ra" "0.108ra" "0.106ra" "0.104ra" "ra"
[12,] "ra" "ra"      "ra"       NA           NA          NA          NA          NA         

使用这种方法,数字似乎处于良好的列中。

最新更新