在R中,如何根据时间事件对跨越时间和空间的数据帧进行子集划分



更具体地说:我有一个数据帧(my.df),类似于下面的

  City       Month Answer
  Montreal   Jan      n
  Montreal   Feb      n
  Montreal   Mar      n
  Toronto    Jan      oui
  Toronto    Feb      n
  Toronto    Mar      n
  Calgary    Jan      n
  Calgary    Feb      n
  Calgary    Mar      yes

现在,我需要根据标记为Answer的特征进行子集。更准确地说,如果Answeroui(比如1月份的多伦多)或yes(比如3月份的卡尔加里),我需要得到类似的东西

  City      Month Answer
  Toronto   Jan      oui 
  Toronto   Feb      n
  Toronto   Mar      n
  Calgary   Jan      n
  Calgary   Feb      n
  Calgary   Mar      yes

换句话说,一个数据帧不包含蒙特利尔的条目(既没有oui也没有yes)。

我的数据帧是dim(37045, 41),在Answer下有一些混乱的条目,如ouuyessoii。我尝试将regex与%in%结合使用,如:

  oui <- grep('ou', Answer)    
  yes <- grep('ye', Answer)    
  oui.yes <- union(oui, yes)
  ans <- my.df[oui.yes, 3]    
  new.df <- my.df[Ans %in% my.df$Answer, ]

不幸的是,得到的new.dfmy.df完全相同。

任何帮助都将不胜感激。

伊格纳西奥·维拉。

一种方法是使用base R 中的ave

df[with(df, ave(Answer %in% c("oui", "yes"), City, FUN=any)),]
#      City Month Answer
#4 Toronto   Jan    oui
#5 Toronto   Feb      n
#6 Toronto   Mar      n
#7 Calgary   Jan      n
#8 Calgary   Feb      n
#9 Calgary   Mar    yes

或使用data.table

library(data.table)
setDT(df)[df[,.I[any(Answer %in% c("oui", "yes"))], by=City]$V1,]
#      City Month Answer
#1: Toronto   Jan    oui
#2: Toronto   Feb      n
#3: Toronto   Mar      n
#4: Calgary   Jan      n
#5: Calgary   Feb      n
#6: Calgary   Mar    yes

数据

df <- structure(list(City = c("Montreal", "Montreal", "Montreal", "Toronto", 
 "Toronto", "Toronto", "Calgary", "Calgary", "Calgary"), Month = c("Jan", 
 "Feb", "Mar", "Jan", "Feb", "Mar", "Jan", "Feb", "Mar"), Answer = c("n", 
 "n", "n", "oui", "n", "n", "n", "n", "yes")), .Names = c("City", 
"Month", "Answer"), class = "data.frame", row.names = c(NA, -9L
))

你真的很接近。

dat <- structure(list(City = c("Montreal", "Montreal", "Montreal", "Toronto", 
       "Toronto", "Toronto", "Calgary", "Calgary", "Calgary"), Month = c("Jan", 
       "Feb", "Mar", "Jan", "Feb", "Mar", "Jan", "Feb", "Mar"), Answer = c("n", 
       "n", "n", "oui", "n", "n", "n", "n", "yes")), .Names = c("City", 
       "Month", "Answer"), class = "data.frame", row.names = c(NA, -9L
dat[dat$City %in% unique(dat[dat$Answer %in% c("yes", "oui"),]$City),]
##      City Month Answer
## 4 Toronto   Jan    oui
## 5 Toronto   Feb      n
## 6 Toronto   Mar      n
## 7 Calgary   Jan      n
## 8 Calgary   Feb      n
## 9 Calgary   Mar    yes

您可以将其拆分(为了可读性):

positive_cities <- unique(dat[dat$Answer %in% c("yes", "oui"),]$City)
dat[dat$City %in% positive_cities,]

而且,还有无数其他方法可以实现这一点。

最新更新