尝试在r中使用正则表达式将在线表单中的问题与答案分离



我正在r中读取表单到数据框架中。其中一列包含表单的内容,包括问题和答案。我试着把这两者分开,而不是通过各种形式的每个问题组合。数据的结构如下所示

text <- c('Select the benefit your question is related to: Life or AD&D Insurancern What is your question?: I am interested in purchasing additional life insurance and was wondering if someone could assist me with locating my most recent Life Insurance Statement.rn')
number <- c(1)
df <- data.frame(number,text) 

所以每个问题都从新的一行开始,以冒号结束。

所以我想以问题列表和相应的答案列表结束。

首先,将文本分成单独的行。然后,一个简单的正则表达式允许您取出问题和答案。

Lines = unlist(strsplit(text, "rn"))
Questions = sub("(.*?):.*", "\1", Lines)
Answers   = sub(".*?:(.*)", "\1", Lines)
Questions
[1] "Select the benefit your question is related to"
[2] " What is your question?"                       
Answers
[1] " Life or AD&D Insurance"                                                                                                                                     
[2] " I am interested in purchasing additional life insurance and was wondering if someone could assist me with locating my most recent Life Insurance Statement."

使用stringr::str_match_all:

tmp <- lapply(stringr::str_match_all(df$text, '\s*(.*?):\s*(.*?)rn'), 
function(x) x[, -1])
result <- cbind(id = rep(df$number, sapply(tmp, nrow)), 
do.call(rbind.data.frame, tmp))
names(result) <- c('question', 'answer')
result

最快的方法是使用gsub,有效的代码将是:

df<-data.frame(question=gsub("[?][:].*","?",text), answer=gsub(".*[?:]","",text),id=1)

我写的模式将搜索?:的第一个事件,因为我看到你使用这个组合来区分问题和答案。因此,如果有多个?:实例,那么gsub将考虑第一个实例来区分问题和答案。此代码将?:之后的任何内容替换为单个?标记,以形成给定文本的问题部分,因此,如果您不需要在问题末尾使用问号,则可以将该部分更改为:question=gsub("[?][:].*","",text)

输出为:

df$answer 
[1] I am interested in purchasing additional life insurance and was wondering if someone could assist me with locating my most recent Life Insurance Statement.rn

和问题:

df$question
[1] Select the benefit your question is related to: Life or AD&D Insurancern What is your question?

最新更新