我正在努力从大约1000个pdf文件中抓取文本数据。我已经设法将它们全部导入R-studio,使用str_subset
和str_extract_all
来获得我需要的较小属性。这个项目的主要目标是抓取病例历史叙述数据。这些是自然语言的段落,由在所有单独文档中标准化的独特单词所限定。下面是一个复制的例子。
我能不能用这两个独特的词,(&;CASE HISTORY &;;调查员:"),要绑定我想提取的文本吗?如果没有,我可以采取什么样的方法从每个报告中提取我需要的叙述性数据?
text_data <- list("ES SPRINGFEILD POLICE DE FARRELL #789n NOTIFIED DATE TIME OFFICERnMARITAL STATUS: UNKNOWNnIDENTIFIED BY: H. POIROT AT: SCENE DATE: 01/02/1895nFINGERPRINTS TAKEN BY DATEn YES NO OBIWAN KENOBI 01/02/1895n
SPRINGFEILDn CASE#: 012-345-678n ABC NOTIFIED: ABC DATE:n ABC OFFICER: NATURE:nCASE HISTORYn This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string andn Case#: 012-345-678n EXAMINER / INVESTIGATOR'S REPORTn CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASEnit continues on another page. It’s 1 page but mostly but often more than 1, 2 evenn the next capitalized word, investigator with a colon, is a unique word where the string stops.nINVESTIGATOR: HERCULE POIROT n")
这是预期的输出。
output <- list("This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string andn Case#: 012-345-678n EXAMINER / INVESTIGATOR'S REPORTn CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASEnit continues on another page. It’s 1 page but mostly but often more than 1, 2 evenn the next capitalized word, investigator with a colon, is a unique word where the string stops.")
谢谢你的帮助!
一个快速的方法是使用gsub
和regexes来替换所有直到并包括CASE HISTORY ('^.*CASE HISTORY'
)和调查员('INVESTIGATOR:.*'
)之后的所有内容。剩下的将是这两个匹配之间的文本。
gsub('INVESTIGATOR:.*', '', gsub('^.*CASE HISTORY', '', text_data))
[1] "n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string andn Case#: 012-345-678n EXAMINER / INVESTIGATOR'S REPORTn CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASEnit continues on another page. It’s 1 page but mostly but often more than 1, 2 evenn the next capitalized word, investigator with a colon, is a unique word where the string stops.n"
经过深思熟虑,我得出了一个值得分享的解决方案,所以我们开始吧:
# unlist text_data
file_contents_unlist <-
paste(unlist(text_data), collapse = " ")
# read lines, squish for good measure.
file_contents_lines <-
file_contents_unlist%>%
readr::read_lines() %>%
str_squish()
# Create indicies in the lines of our text data based upon regex grepl
# functions, be sure they match if scraping multiple chunks of data..
index_case_num_1 <- which(grepl("(Case#: \d+[-]\d+)",
file_contents_lines))
index_case_num_2 <- which(grepl("(Case#: \d+[-]\d+)",
file_contents_lines))
# function basically states, "give me back whatever's in those indices".
pull_case_num <-
function(index_case_num_1, index_case_num_2){
(file_contents_lines[index_case_num_1:index_case_num_2]
)
}
# map2() to iterate.
case_nums <- map2(index_case_num_1,
index_case_num_2,
pull_case_num)
# transform to dataframe
case_nums_df <- as.data.frame.character(case_nums)
# Repeat pattern for other vectors as needed.
index_case_hist_1 <-
which(grepl("CASE HISTORY", file_contents_lines))
index_case_hist_2 <-
which(grepl("Case#: ", file_contents_lines))
pull_case_hist <- function(index_case_hist_1,
index_case_hist_2 )
{(file_contents_lines[index_case_hist_1:index_case_hist_2]
)
}
case_hist <- map2(index_case_hist_1,
index_case_hist_2,
pull_case_hist)
case_hist_df <- as.data.frame.character(case_hist)
# cbind() the vectors, also a good call place to debug from.
cases_comp <- cbind(case_nums_df, case_hist_df)
感谢大家的回复。我希望这个解决方案能在未来帮助到其他人。:)