更具体地说:我有一个数据帧(my.df
),类似于下面的
City Month Answer
Montreal Jan n
Montreal Feb n
Montreal Mar n
Toronto Jan oui
Toronto Feb n
Toronto Mar n
Calgary Jan n
Calgary Feb n
Calgary Mar yes
现在,我需要根据标记为Answer
的特征进行子集。更准确地说,如果Answer
是oui
(比如1月份的多伦多)或yes
(比如3月份的卡尔加里),我需要得到类似的东西
City Month Answer
Toronto Jan oui
Toronto Feb n
Toronto Mar n
Calgary Jan n
Calgary Feb n
Calgary Mar yes
换句话说,一个数据帧不包含蒙特利尔的条目(既没有oui也没有yes)。
我的数据帧是dim(37045, 41)
,在Answer
下有一些混乱的条目,如ouu
、yess
或oii
。我尝试将regex与%in%
结合使用,如:
oui <- grep('ou', Answer)
yes <- grep('ye', Answer)
oui.yes <- union(oui, yes)
ans <- my.df[oui.yes, 3]
new.df <- my.df[Ans %in% my.df$Answer, ]
不幸的是,得到的new.df
与my.df
完全相同。
任何帮助都将不胜感激。
伊格纳西奥·维拉。
一种方法是使用base R
中的ave
df[with(df, ave(Answer %in% c("oui", "yes"), City, FUN=any)),]
# City Month Answer
#4 Toronto Jan oui
#5 Toronto Feb n
#6 Toronto Mar n
#7 Calgary Jan n
#8 Calgary Feb n
#9 Calgary Mar yes
或使用data.table
library(data.table)
setDT(df)[df[,.I[any(Answer %in% c("oui", "yes"))], by=City]$V1,]
# City Month Answer
#1: Toronto Jan oui
#2: Toronto Feb n
#3: Toronto Mar n
#4: Calgary Jan n
#5: Calgary Feb n
#6: Calgary Mar yes
数据
df <- structure(list(City = c("Montreal", "Montreal", "Montreal", "Toronto",
"Toronto", "Toronto", "Calgary", "Calgary", "Calgary"), Month = c("Jan",
"Feb", "Mar", "Jan", "Feb", "Mar", "Jan", "Feb", "Mar"), Answer = c("n",
"n", "n", "oui", "n", "n", "n", "n", "yes")), .Names = c("City",
"Month", "Answer"), class = "data.frame", row.names = c(NA, -9L
))
你真的很接近。
dat <- structure(list(City = c("Montreal", "Montreal", "Montreal", "Toronto",
"Toronto", "Toronto", "Calgary", "Calgary", "Calgary"), Month = c("Jan",
"Feb", "Mar", "Jan", "Feb", "Mar", "Jan", "Feb", "Mar"), Answer = c("n",
"n", "n", "oui", "n", "n", "n", "n", "yes")), .Names = c("City",
"Month", "Answer"), class = "data.frame", row.names = c(NA, -9L
dat[dat$City %in% unique(dat[dat$Answer %in% c("yes", "oui"),]$City),]
## City Month Answer
## 4 Toronto Jan oui
## 5 Toronto Feb n
## 6 Toronto Mar n
## 7 Calgary Jan n
## 8 Calgary Feb n
## 9 Calgary Mar yes
您可以将其拆分(为了可读性):
positive_cities <- unique(dat[dat$Answer %in% c("yes", "oui"),]$City)
dat[dat$City %in% positive_cities,]
而且,还有无数其他方法可以实现这一点。