r 根据关键字搜索字符串,并检查日期是否在开始日期和结束日期之间



我有一组句子,

{ cat ate rat, rat was killed, cat killed the rat, rat killed by rat}.

第一(我想搜索列 col2 中的值是否包含这些句子中的任何一个

第二(如果有匹配项,那么我想检查 Col3 中的日期是否在 col4 和 col5 中的开始和结束日期之间。

下面是一个测试数据集

Id      Col2                Col3        Col4        Col5
1       This cat            05-09-2001  04-10-2000  09-14-2001
2       This cat ate a rat  05-04-2011  05-01-2011  05-14-2011
3       Cat was killed      02-04-2015  02-01-2015  03-12-2015
4       Cat killed the rat  10-06-2014  09-20-2014  10-11-2014
5       Rat ran away        03-12-2008  04-12-2015  04-20-2015

这是预期的输出

Id      Col2                Col3        Col4        Col5         Event
1       This cat            05-09-2001  04-10-2000  09-14-2001   No
2       Cat ate rat         05-04-2011  05-01-2011  05-14-2011   Yes
3       Cat died            02-04-2015  02-01-2015  03-12-2015   No
4       Cat killed the rat  10-06-2014  09-20-2014  10-11-2014   Yes
5       Rat ran away        03-12-2008  04-12-2015  04-20-2015   No

这就是id到目前为止所做的。下面的代码正在工作。我得到了我想要的结果,但这非常低效。它非常慢,需要很长时间。特别是如果我的 df 包含 300 万行,我需要 10 天才能完成运行此代码。非常感谢有关解决此问题的有效方法的任何建议。

关键词 <- c("猫吃了

老鼠"、"老鼠被杀了"、"猫杀死了老鼠"、"老鼠杀死了老鼠"(
for (i in 1:NROW(Df)) {
if( NROW(Df[grep(paste0(keywords, collapse = "|"), Df$Col2[i]),]) > 0) {
if ( (Df$Col3[i] > Df$Col4[i]) & (Df$Col3[i] < Df$Col5[i]) ){
Df$Event <- "Yes"
} else {
Df$Event <- "No"
}

}
print(i)
}

基本上你需要测试三个条件。

  • Col3>=Col4
  • Col3<=Col5
  • 关键字中Col2

使用矢量化函数(如ifelse%in%(来加快代码速度。

mydf <- structure(list(Id = 1:5, Col2 = c("This cat", "This cat ate a rat", 
"Cat was killed", "Cat killed the rat", "Rat ran away"), Col3 = structure(c(11451, 
15098, 16470, 16349, 13950), class = "Date"), Col4 = structure(c(11057, 
15095, 16467, 16333, 16537), class = "Date"), Col5 = structure(c(11579, 
15108, 16506, 16354, 16545), class = "Date")), .Names = c("Id", 
"Col2", "Col3", "Col4", "Col5"), row.names = c(NA, -5L), class = "data.frame")
keywords <- c("cat ate rat", "rat was killed", "cat killed the rat", "rat killed by rat")
mydf$event <- ifelse((mydf$Col3 >= mydf$Col4) & (mydf$Col3 <= mydf$Col5) 
& mydf$Col2 %in% keywords, "Yes", "No")

请注意,此版本区分大小写。您可能对tolower等功能感兴趣。

mydf$event <- ifelse((mydf$Col3 >= mydf$Col4) & (mydf$Col3 <= mydf$Col5) 
& tolower(mydf$Col2) %in% keywords, "Yes", "No")

简短回答:

df$Event <- sapply(tolower(df$Col2), function(el) el %in% sentences)

在 for 循环中执行您想要的操作。

在 R 中,您必须避免 for 循环并尝试使用apply-family函数。tolower使 df$Col2 的内容小写。 对于此列向量的每个元素,将应用定义的函数function(el) el %in% sentences(它询问每个元素是否是sentences字符向量的一部分,并且首先将布尔结果收集到列表中,但随后,它尝试将收集的结果进一步嵌入向量(sapply(。

完整工作代码版本:

数据读取和准备

sentences <- unlist(strsplit("cat ate rat, rat was killed, cat killed the rat, rat killed by rat",", "))

只是为了将给定的文本更改为数据框

txt2df <- function(dfstr) {
lines <- unlist(strsplit(txt, "n"))
l <- unlist(lapply(lines,strsplit, " {2, }"), recursive = FALSE)
df <- as.data.frame(Reduce(rbind, l[2:length(l)]), row.names = FALSE)
colnames(df) <- l[[1]]
df
}

将该函数应用于多行字符串以获取 data.frame:

df <- txt2df("Id      Col2                Col3        Col4        Col5
1       This cat            05-09-2001  04-10-2000  09-14-2001
2       This cat ate a rat  05-04-2011  05-01-2011  05-14-2011
3       Cat was killed      02-04-2015  02-01-2015  03-12-2015
4       Cat killed the rat  10-06-2014  09-20-2014  10-11-2014
5       Rat ran away        03-12-2008  04-12-2015  04-20-2015")

df
Id               Col2       Col3       Col4       Col5
1  1           This cat 05-09-2001 04-10-2000 09-14-2001
2  2 This cat ate a rat 05-04-2011 05-01-2011 05-14-2011
3  3     Cat was killed 02-04-2015 02-01-2015 03-12-2015
4  4 Cat killed the rat 10-06-2014 09-20-2014 10-11-2014
5  5       Rat ran away 03-12-2008 04-12-2015 04-20-2015

查找功能

查找 df$Col2 值的小写是否为任一句子:

df$Event <- sapply(tolower(df$Col2), function(el) el %in% sentences)

结果

df
Id               Col2       Col3       Col4       Col5 Event
1  1           This cat 05-09-2001 04-10-2000 09-14-2001 FALSE
2  2 This cat ate a rat 05-04-2011 05-01-2011 05-14-2011 FALSE
3  3     Cat was killed 02-04-2015 02-01-2015 03-12-2015 FALSE
4  4 Cat killed the rat 10-06-2014 09-20-2014 10-11-2014  TRUE
5  5       Rat ran away 03-12-2008 04-12-2015 04-20-2015 FALSE

最新更新