我正试图从目录中的一组PDF文件准备的300行列表中提取几行。
所有的pdf文件都在一个300行的列表中。现在我想提取具有匹配单词的行。
library(stringr)
library(pdftools)
library(tm)
library(tidyverse)
library(rex)
#Directory with multiple pdf files
files<- list.files(pattern='pdf$')
#Extract all files content into a list
lapply(files, function(x) strsplit(pdf_text(x), "n")[[1]]) -> result
#change the type for ease of processing
mylist <- unlist(result) %>% str_split("n")
#Squish all the words in a line together with space default
str_squish(mylist)
#Find lines that has a match with the mentioned string (ex: Table in t)
t <- grep("Table", mylist)
t1 <- grep("T[0-9]", mylist)
f <- grep("Figure", mylist)
f1 <- grep("F[0-9]", mylist)
l <- grep("Listing",mylist[1:300])
l1 <- grep("L[0-9]", mylist)
s <- grep("Source", mylist)
# Output of t with indices where there is a match for string "Table"
> t
[1] 46 71 95 124 153 250 278
#Now how to print these indices values to a new list? or Do i go back to mylist and pass the indices numbers and extract it from mylist. What is the best way to do it ?
----------------------------
当我运行这些代码行(t,t1,f,f1,l,l1,s)时,我得到了该行中匹配字符串的索引。
下面的是带有输出的图像,其中显示了与之匹配的行。
现在我只需要将这些行打印到另一个列表中。我该怎么做,请告诉我。
没有测试数据很难说,下面的代码是未经测试的。
将模式放在一个列表中,lapply/grep
和value = TRUE
。返回一个列表,每个成员都是匹配字符串的vector。
search_list <- list("Table", "T[0-9]", "Figure", "F[0-9]", "Listing", "L[0-9]", "Source")
matches_list <- lapply(search_list, grep, x = mylist, value = TRUE)