我有一组句子,
{ cat ate rat, rat was killed, cat killed the rat, rat killed by rat}
.
第一(我想搜索列 col2 中的值是否包含这些句子中的任何一个
第二(如果有匹配项,那么我想检查 Col3 中的日期是否在 col4 和 col5 中的开始和结束日期之间。
下面是一个测试数据集
Id Col2 Col3 Col4 Col5
1 This cat 05-09-2001 04-10-2000 09-14-2001
2 This cat ate a rat 05-04-2011 05-01-2011 05-14-2011
3 Cat was killed 02-04-2015 02-01-2015 03-12-2015
4 Cat killed the rat 10-06-2014 09-20-2014 10-11-2014
5 Rat ran away 03-12-2008 04-12-2015 04-20-2015
这是预期的输出
Id Col2 Col3 Col4 Col5 Event
1 This cat 05-09-2001 04-10-2000 09-14-2001 No
2 Cat ate rat 05-04-2011 05-01-2011 05-14-2011 Yes
3 Cat died 02-04-2015 02-01-2015 03-12-2015 No
4 Cat killed the rat 10-06-2014 09-20-2014 10-11-2014 Yes
5 Rat ran away 03-12-2008 04-12-2015 04-20-2015 No
这就是id到目前为止所做的。下面的代码正在工作。我得到了我想要的结果,但这非常低效。它非常慢,需要很长时间。特别是如果我的 df 包含 300 万行,我需要 10 天才能完成运行此代码。非常感谢有关解决此问题的有效方法的任何建议。
关键词 <- c("猫吃了
老鼠"、"老鼠被杀了"、"猫杀死了老鼠"、"老鼠杀死了老鼠"(for (i in 1:NROW(Df)) {
if( NROW(Df[grep(paste0(keywords, collapse = "|"), Df$Col2[i]),]) > 0) {
if ( (Df$Col3[i] > Df$Col4[i]) & (Df$Col3[i] < Df$Col5[i]) ){
Df$Event <- "Yes"
} else {
Df$Event <- "No"
}
}
print(i)
}
基本上你需要测试三个条件。
Col3
>=Col4
Col3
<=Col5
- 关键字中
Col2
使用矢量化函数(如ifelse
或%in%
(来加快代码速度。
mydf <- structure(list(Id = 1:5, Col2 = c("This cat", "This cat ate a rat",
"Cat was killed", "Cat killed the rat", "Rat ran away"), Col3 = structure(c(11451,
15098, 16470, 16349, 13950), class = "Date"), Col4 = structure(c(11057,
15095, 16467, 16333, 16537), class = "Date"), Col5 = structure(c(11579,
15108, 16506, 16354, 16545), class = "Date")), .Names = c("Id",
"Col2", "Col3", "Col4", "Col5"), row.names = c(NA, -5L), class = "data.frame")
keywords <- c("cat ate rat", "rat was killed", "cat killed the rat", "rat killed by rat")
mydf$event <- ifelse((mydf$Col3 >= mydf$Col4) & (mydf$Col3 <= mydf$Col5)
& mydf$Col2 %in% keywords, "Yes", "No")
请注意,此版本区分大小写。您可能对tolower
等功能感兴趣。
mydf$event <- ifelse((mydf$Col3 >= mydf$Col4) & (mydf$Col3 <= mydf$Col5)
& tolower(mydf$Col2) %in% keywords, "Yes", "No")
简短回答:
df$Event <- sapply(tolower(df$Col2), function(el) el %in% sentences)
在 for 循环中执行您想要的操作。
在 R 中,您必须避免 for 循环并尝试使用apply
-family函数。tolower
使 df$Col2 的内容小写。 对于此列向量的每个元素,将应用定义的函数function(el) el %in% sentences
(它询问每个元素是否是sentences
字符向量的一部分,并且首先将布尔结果收集到列表中,但随后,它尝试将收集的结果进一步嵌入向量(sapply
(。
完整工作代码版本:
数据读取和准备
sentences <- unlist(strsplit("cat ate rat, rat was killed, cat killed the rat, rat killed by rat",", "))
只是为了将给定的文本更改为数据框
txt2df <- function(dfstr) {
lines <- unlist(strsplit(txt, "n"))
l <- unlist(lapply(lines,strsplit, " {2, }"), recursive = FALSE)
df <- as.data.frame(Reduce(rbind, l[2:length(l)]), row.names = FALSE)
colnames(df) <- l[[1]]
df
}
将该函数应用于多行字符串以获取 data.frame:
df <- txt2df("Id Col2 Col3 Col4 Col5
1 This cat 05-09-2001 04-10-2000 09-14-2001
2 This cat ate a rat 05-04-2011 05-01-2011 05-14-2011
3 Cat was killed 02-04-2015 02-01-2015 03-12-2015
4 Cat killed the rat 10-06-2014 09-20-2014 10-11-2014
5 Rat ran away 03-12-2008 04-12-2015 04-20-2015")
df
Id Col2 Col3 Col4 Col5
1 1 This cat 05-09-2001 04-10-2000 09-14-2001
2 2 This cat ate a rat 05-04-2011 05-01-2011 05-14-2011
3 3 Cat was killed 02-04-2015 02-01-2015 03-12-2015
4 4 Cat killed the rat 10-06-2014 09-20-2014 10-11-2014
5 5 Rat ran away 03-12-2008 04-12-2015 04-20-2015
查找功能
查找 df$Col2 值的小写是否为任一句子:
df$Event <- sapply(tolower(df$Col2), function(el) el %in% sentences)
结果
df
Id Col2 Col3 Col4 Col5 Event
1 1 This cat 05-09-2001 04-10-2000 09-14-2001 FALSE
2 2 This cat ate a rat 05-04-2011 05-01-2011 05-14-2011 FALSE
3 3 Cat was killed 02-04-2015 02-01-2015 03-12-2015 FALSE
4 4 Cat killed the rat 10-06-2014 09-20-2014 10-11-2014 TRUE
5 5 Rat ran away 03-12-2008 04-12-2015 04-20-2015 FALSE