我需要一个自动代码来提取R.中的pdf表
所以我在网站上搜索,找到了tabulizer包。
我用
extract_tables(f2,pages = 25,guess=TRUE,encoding = 'UTF-8',method="stream")#f2 is pdf file name
我尝试了每种方法类型,但结果并不整齐。
一些列是混合的,有很多空白,正如您可以看到的图像文件。
我想我会直接修改数据。但其目的是将其自动化,因此需要通用的方法。并不是每个pdf文件都是有组织的。有些桌子非常整洁,每一条相关的线都完美匹配,但另一些则不然。。正如您在我的结果图中看到的,在第4列中,数字混合在同一列中。其他列,数字是一个接一个匹配的,我的意思是,我想让列自动整理成pdf格式的表格。
有没有什么包装或方法可以让提取的桌子整洁?
我的代码结果
PDF 中的表格
使用以下代码,我已经能够提取表中的数字。首先,我将图像转换为PDF文件。之后,我将PDF文件转换为word文件。我终于从word文件中提取了这些表格。此解决方案仅适用于Windows。
library(RDCOMClient)
library(magick)
path_PDF <- "D:\image_Stackoverflow79.pdf"
path_PNG <- "D:\Dropbox\Reponses_Stackoverflow\image_Stackoverflow79.png"
path_Word <- "D:\image_Stackoverflow79.docx"
pdf(path_PDF, height = 8, width = 6)
im <- image_read(path_PNG)
im <- image_crop(im, geometry = geometry_area(width = 510, height = 310, x_off = 100, y_off = 110))
plot(im)
dev.off()
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
nb_Row <- doc$tables(1)$Rows()$Count()
nb_Col <- doc$tables(1)$Columns()$Count()
mat_Temp <- matrix(NA, nrow = nb_Row, ncol = nb_Col)
for(i in 1 : nb_Row)
{
for(j in 1 : nb_Col)
{
mat_Temp[i, j] <- tryCatch(doc$tables(1)$cell(i, j)$range()$text(), error = function(e) NA)
}
}
mat_Temp
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "ra" "ra" "ra" "ra" "ra" "ra" "ra" "ra"
[2,] "ra" "0.46ra" "0.46ra" "0.46ra" "0.46ra" "0.46ra" "0.46ra" "ra"
[3,] "ra" "1.00ra" "0.00ra" "0.98ra" "0.03ra" "0.95ra" "0.85ra" NA
[4,] "ra" "0.025ra" "0.025ra" "0.025ra" "0.025ra" "0.025ra" "0.025ra" NA
[5,] "ra" "0.005ra" "0.005ra" "0.005ra" "0.005ra" "0.005ra" "0.005ra" NA
[6,] "ra" "1.49ra" "0.49ra" "1.47ra" "0.52ra" "1.44ra" "1.34ra" "ra"
[7,] "ra" "0.002ra" "0.002ra" "0.002ra" "0.002ra" "0.002ra" "0.002ra" "ra"
[8,] "ra" "1.492ra" "0.492ra" "1472ra" "0.522ra" "1.442ra" "1.342ra" "ra"
[9,] "ra" "1.59ra" "ra" "1.22ra" "ra" "ra" "ra" "ra"
[10,] "ra" "1.493ra" "0.493ra" "1473ra" "0.523ra" "1.443ra" "1.343ra" "ra"
[11,] "ra" "0.107ra" "o. 108ra" "o. 105ra" "0.108ra" "0.106ra" "0.104ra" "ra"
[12,] "ra" "ra" "ra" NA NA NA NA NA
使用这种方法,数字似乎处于良好的列中。