我正在r中读取表单到数据框架中。其中一列包含表单的内容,包括问题和答案。我试着把这两者分开,而不是通过各种形式的每个问题组合。数据的结构如下所示
text <- c('Select the benefit your question is related to: Life or AD&D Insurancern What is your question?: I am interested in purchasing additional life insurance and was wondering if someone could assist me with locating my most recent Life Insurance Statement.rn')
number <- c(1)
df <- data.frame(number,text)
所以每个问题都从新的一行开始,以冒号结束。
所以我想以问题列表和相应的答案列表结束。
首先,将文本分成单独的行。然后,一个简单的正则表达式允许您取出问题和答案。
Lines = unlist(strsplit(text, "rn"))
Questions = sub("(.*?):.*", "\1", Lines)
Answers = sub(".*?:(.*)", "\1", Lines)
Questions
[1] "Select the benefit your question is related to"
[2] " What is your question?"
Answers
[1] " Life or AD&D Insurance"
[2] " I am interested in purchasing additional life insurance and was wondering if someone could assist me with locating my most recent Life Insurance Statement."
使用stringr::str_match_all
:
tmp <- lapply(stringr::str_match_all(df$text, '\s*(.*?):\s*(.*?)rn'),
function(x) x[, -1])
result <- cbind(id = rep(df$number, sapply(tmp, nrow)),
do.call(rbind.data.frame, tmp))
names(result) <- c('question', 'answer')
result
最快的方法是使用gsub
,有效的代码将是:
df<-data.frame(question=gsub("[?][:].*","?",text), answer=gsub(".*[?:]","",text),id=1)
我写的模式将搜索?:
的第一个事件,因为我看到你使用这个组合来区分问题和答案。因此,如果有多个?:
实例,那么gsub
将考虑第一个实例来区分问题和答案。此代码将?:
之后的任何内容替换为单个?
标记,以形成给定文本的问题部分,因此,如果您不需要在问题末尾使用问号,则可以将该部分更改为:question=gsub("[?][:].*","",text)
输出为:
df$answer
[1] I am interested in purchasing additional life insurance and was wondering if someone could assist me with locating my most recent Life Insurance Statement.rn
和问题:
df$question
[1] Select the benefit your question is related to: Life or AD&D Insurancern What is your question?