当某些分隔符丢失时,如何使用分隔符提取文本



我正在尝试根据半结构化文本文档中的标题提取文本。

输入

Column<-"Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Report: Need to complete Conclusion: Dud"

此处的输出是

Order     Subject Name           Grade  Report           Conclusion
1223442   History Bilbo Johnson   Bad   Need to complete  Dud

我可以通过以下(混乱但有效(功能来实现这一点:

dataframeIn<-data.frame(Column,stringsAsFactors=FALSE)
delim<-c("Order","Subject","Name","Grade","Report","Conclusion")

Extractor <- function(dataframeIn, Column, delim) {
dataframeInForLater<-dataframeIn
ColumnForLater<-Column
Column <- rlang::sym(Column)
dataframeIn <- data.frame(dataframeIn)
dataframeIn<-dataframeIn %>%
tidyr::separate(!!Column, into = c("added_name",delim),
sep = paste(delim, collapse = "|"),
extra = "drop", fill = "right")
names(dataframeIn) <- gsub(".", "", names(dataframeIn), fixed = TRUE)
dataframeIn<-data.frame(dataframeIn)
#Add the original column back in so have the original reference
dataframeIn<-cbind(dataframeInForLater[,ColumnForLater],dataframeIn)
dataframeIn<-data.frame(dataframeIn)
return(dataframeIn)
}
Extractor(dataframeIn, "Column", delim)

然而,有时分隔符丢失,例如

Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Conclusion: Dud

在这种情况下,所需输出为

Order     Subject Name           Grade  Conclusion
1223442   History Bilbo Johnson   Bad    Dud

但实际输出变为:

Order   Subject            Name   Grade Report Conclusion
:1223442  :History   Bilbo Johnson  : Bad    : Dud       <NA>

我如何解释丢失的分隔符,尽管它们的顺序相同(包括文本中间和结尾丢失的分隔符号,如上面的示例所示(?

我们可以执行以下操作(这只是文本提取,我为您构建输出(:

library(stringr)
Extractor <- function(x, delim) {
pattern <- paste0(delim, ":{0,1}(.*?)(", paste(c(delim, "$"), collapse = "|"), ")")
trimws(str_match(x, pattern)[, 2])
}
Extractor(Column1, delim)
# [1] "1223442"          "History"          "Bilbo Johnson"    "Bad"              "Need to complete" "Dud"
Extractor(Column2, delim)
# [1] "1223442"       "History"       "Bilbo Johnson" "Bad"           NA              "Dud"
Column3 <- "Subject:History Name Bilbo Johnson"
Extractor(Column3, delim)
# [1] NA              "History"       "Bilbo Johnson" NA              NA              NA

由于我们有NA,所以很清楚哪些分隔符丢失了,哪些没有。

它在您的情况下的工作方式是,我们有一系列模式

pattern
# [1] "Order:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"     
# [2] "Subject:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"   
# [3] "Name:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"      
# [4] "Grade:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"     
# [5] "Report:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"    
# [6] "Conclusion:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"

然后str_match-nice将(.*?)部分提取到第二个输出列,并用trimws去掉任何空间。啊,我们在(.*?)中使用了懒惰匹配,以避免匹配过多。