r语言 - 提取较大的字符数据与字符串?



我正在努力从大约1000个pdf文件中抓取文本数据。我已经设法将它们全部导入R-studio,使用str_subsetstr_extract_all来获得我需要的较小属性。这个项目的主要目标是抓取病例历史叙述数据。这些是自然语言的段落,由在所有单独文档中标准化的独特单词所限定。下面是一个复制的例子。

我能不能用这两个独特的词,(&;CASE HISTORY &;;调查员:"),要绑定我想提取的文本吗?如果没有,我可以采取什么样的方法从每个报告中提取我需要的叙述性数据?

text_data <- list("ES                     SPRINGFEILD POLICE DE     FARRELL #789n NOTIFIED                  DATE           TIME               OFFICERnMARITAL STATUS:       UNKNOWNnIDENTIFIED BY:    H. POIROT                     AT:   SCENE              DATE:    01/02/1895nFINGERPRINTS TAKEN BY                         DATEn YES                      NO                  OBIWAN KENOBI                            01/02/1895n
SPRINGFEILDn CASE#:       012-345-678n ABC NOTIFIED:                                    ABC DATE:n ABC OFFICER:                                           NATURE:nCASE HISTORYn    This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string andn                                            Case#:           012-345-678n                          EXAMINER / INVESTIGATOR'S REPORTn                                 CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASEnit continues on another page. It’s 1 page but mostly but often more than 1, 2 evenn     the next capitalized word, investigator with a colon, is a unique word where the string stops.nINVESTIGATOR:       HERCULE POIROT             n")

这是预期的输出。

output <- list("This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string andn                                            Case#:           012-345-678n                          EXAMINER / INVESTIGATOR'S REPORTn                                 CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASEnit continues on another page. It’s 1 page but mostly but often more than 1, 2 evenn     the next capitalized word, investigator with a colon, is a unique word where the string stops.")

谢谢你的帮助!

一个快速的方法是使用gsub和regexes来替换所有直到并包括CASE HISTORY ('^.*CASE HISTORY')和调查员('INVESTIGATOR:.*')之后的所有内容。剩下的将是这两个匹配之间的文本。

gsub('INVESTIGATOR:.*', '', gsub('^.*CASE HISTORY', '', text_data))
[1] "n    This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string andn                                            Case#:           012-345-678n                          EXAMINER / INVESTIGATOR'S REPORTn                                 CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASEnit continues on another page. It’s 1 page but mostly but often more than 1, 2 evenn     the next capitalized word, investigator with a colon, is a unique word where the string stops.n"

经过深思熟虑,我得出了一个值得分享的解决方案,所以我们开始吧:

# unlist text_data
file_contents_unlist <- 
paste(unlist(text_data), collapse = " ")
# read lines, squish for good measure. 
file_contents_lines <- 
file_contents_unlist%>% 
readr::read_lines() %>% 
str_squish()
# Create indicies in the lines of our text data based upon regex grepl 
# functions, be sure they match if scraping multiple chunks of data..
index_case_num_1 <- which(grepl("(Case#: \d+[-]\d+)", 
file_contents_lines))
index_case_num_2 <- which(grepl("(Case#: \d+[-]\d+)", 
file_contents_lines))
# function basically states, "give me back whatever's in those indices".
pull_case_num <- 
function(index_case_num_1, index_case_num_2){
(file_contents_lines[index_case_num_1:index_case_num_2]
)
} 

# map2() to iterate. 
case_nums <- map2(index_case_num_1, 
index_case_num_2, 
pull_case_num) 
# transform to dataframe
case_nums_df <- as.data.frame.character(case_nums)
# Repeat pattern for other vectors as needed. 
index_case_hist_1 <- 
which(grepl("CASE HISTORY", file_contents_lines))
index_case_hist_2 <- 
which(grepl("Case#: ", file_contents_lines))
pull_case_hist <- function(index_case_hist_1, 
index_case_hist_2 )
{(file_contents_lines[index_case_hist_1:index_case_hist_2]
)
} 
case_hist <- map2(index_case_hist_1, 
index_case_hist_2, 
pull_case_hist)
case_hist_df <- as.data.frame.character(case_hist)
# cbind() the vectors, also a good call place to debug from. 
cases_comp <- cbind(case_nums_df, case_hist_df)

感谢大家的回复。我希望这个解决方案能在未来帮助到其他人。:)

最新更新