我正在尝试根据半结构化文本文档中的标题提取文本。
输入
Column<-"Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Report: Need to complete Conclusion: Dud"
此处的输出是
Order Subject Name Grade Report Conclusion
1223442 History Bilbo Johnson Bad Need to complete Dud
我可以通过以下(混乱但有效(功能来实现这一点:
dataframeIn<-data.frame(Column,stringsAsFactors=FALSE)
delim<-c("Order","Subject","Name","Grade","Report","Conclusion")
Extractor <- function(dataframeIn, Column, delim) {
dataframeInForLater<-dataframeIn
ColumnForLater<-Column
Column <- rlang::sym(Column)
dataframeIn <- data.frame(dataframeIn)
dataframeIn<-dataframeIn %>%
tidyr::separate(!!Column, into = c("added_name",delim),
sep = paste(delim, collapse = "|"),
extra = "drop", fill = "right")
names(dataframeIn) <- gsub(".", "", names(dataframeIn), fixed = TRUE)
dataframeIn<-data.frame(dataframeIn)
#Add the original column back in so have the original reference
dataframeIn<-cbind(dataframeInForLater[,ColumnForLater],dataframeIn)
dataframeIn<-data.frame(dataframeIn)
return(dataframeIn)
}
Extractor(dataframeIn, "Column", delim)
然而,有时分隔符丢失,例如
Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Conclusion: Dud
在这种情况下,所需输出为
Order Subject Name Grade Conclusion
1223442 History Bilbo Johnson Bad Dud
但实际输出变为:
Order Subject Name Grade Report Conclusion
:1223442 :History Bilbo Johnson : Bad : Dud <NA>
我如何解释丢失的分隔符,尽管它们的顺序相同(包括文本中间和结尾丢失的分隔符号,如上面的示例所示(?
我们可以执行以下操作(这只是文本提取,我为您构建输出(:
library(stringr)
Extractor <- function(x, delim) {
pattern <- paste0(delim, ":{0,1}(.*?)(", paste(c(delim, "$"), collapse = "|"), ")")
trimws(str_match(x, pattern)[, 2])
}
Extractor(Column1, delim)
# [1] "1223442" "History" "Bilbo Johnson" "Bad" "Need to complete" "Dud"
Extractor(Column2, delim)
# [1] "1223442" "History" "Bilbo Johnson" "Bad" NA "Dud"
Column3 <- "Subject:History Name Bilbo Johnson"
Extractor(Column3, delim)
# [1] NA "History" "Bilbo Johnson" NA NA NA
由于我们有NA
,所以很清楚哪些分隔符丢失了,哪些没有。
它在您的情况下的工作方式是,我们有一系列模式
pattern
# [1] "Order:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [2] "Subject:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [3] "Name:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [4] "Grade:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [5] "Report:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [6] "Conclusion:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
然后str_match
-nice将(.*?)
部分提取到第二个输出列,并用trimws
去掉任何空间。啊,我们在(.*?)
中使用了懒惰匹配,以避免匹配过多。