将扫描的PDF转换为可搜索的PDF (in R)

我正在尝试使用tesseract和pdftools包将一系列扫描的PDF转换为可搜索的PDF。我已经完成了两个步骤。现在我需要写回一个可搜索的pdf。

读取扫描PDF
运行OCR
写回可搜索的PDF

eg <- download.file("https://www.fujitsu.com/global/Images/sv600_c_automatic.pdf", "example.pdf", mode = "wb")
results <- tesseract::ocr_data("example.pdf", engine = "eng")

R> results
# A tibble: 406 x 3
word        confidence bbox             
<chr>            <dbl> <chr>            
1 PFU               96.9 228,181,404,249  
2 Business          96.2 459,180,847,249  
3 report            96.2 895,182,1145,259 
4 |                 52.5 3980,215,3984,222
5 No.068            91.0 4439,163,4754,237
6 New               96.0 493,503,1005,687 
7 customer's        94.6 1069,484,2231,683
8 development       96.5 2304,483,3714,732
9 di                90.4 767,763,1009,959 
10 ing               96.3 1754,773,1786,807
# ... with 396 more rows

或者，我可以在R中调用另一个包或命令行工具吗?

我也有类似的需求，用R写了一个简单的函数来调用OCRmyPDF的命令行。

我正在使用Ubuntu，所以首先在Ubuntu中安装OCRmyPDF:

sudo apt install ocrmypdf

这是在其他操作系统上安装它的信息。

然后运行:加载R函数

ocr_my_pdf <- function(path_read, ..., path_save = NULL){

path_read <- here::here(path_read)
if(is.null(path_save)){ 
path_save <- stringr::str_replace(path_read, '(?i)\.pdf$','_ocr.pdf') 
} else {
path_save <- here::here(path_save)
}

sys_args <- c(
glue::glue("'{unlist(list(...))}'"), 
glue::glue("'{path_read}'"), 
glue::glue("'{path_save}'"))
system2('ocrmypdf', args = sys_args) 

}

然后在测试PDF上调用函数:

ocr_my_pdf('/home/test.pdf')

或者，您想传递的任何附加参数:

ocr_my_pdf('test.pdf', '--deskew', '--clean', '--rotate-pages')

下面是可用参数的信息。

下面是一种基于RDCOMClient R包的方法。基本上，我们将PDF转换为Word。在这个过程中，Word使用了一个嵌入式OCR。然后，使用Word软件，我们将Word文档转换为可搜索的PDF文件。

library(RDCOMClient)
download.file("https://www.fujitsu.com/global/Images/sv600_c_automatic.pdf", "example.pdf", mode = "wb")
path_PDF <- "C:/example.pdf"
path_Word <- "C:/example.docx"
################################################################
#### Step 1 : Convert PDF to word document with OCR of Word ####
################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
doc_Selection <- wordApp$Selection()
##########################################################
#### Step 3 : Convert word document to searchable pdf ####
##########################################################
path_PDF_Searchable <- "C:/example_searchable.pdf"
wordApp[["ActiveDocument"]]$SaveAs(path_PDF_Searchable, FileFormat = 17) # FileFormat = 17 saves as .PDF
doc$Close()
wordApp$Quit() # quit wordApp

如果您的计算机上安装了ECopy软件(不是免费软件)，您可以使用以下函数将扫描的pdf文件转换为可搜索的pdf文件:

ecopy_Scanned_PDF_To_Numeric_PDF <- function(directory_Scanned_PDF, directory_Numeric_PDF)
{
path_To_BatchConverter <- "C:/Program Files (x86)/Nuance/eCopy PDF Pro Office 6/BatchConverter.com"
args <- paste0("-I", directory_Scanned_PDF, "\*.pdf -O", directory_Numeric_PDF, " -Tpdfs -Lfre -W -V1.5 -J -Ao")
system2(path_To_BatchConverter, args = args)
}

我在工作中使用这个函数，它工作得很好

相关内容

最新更新

热门标签：