r语言 - Count数据框中符合动态条件的行数



我有一个这样的数据框架:

ID <- c("AB1","AB1","CD2","AB3","KK4","AB3","AB3","AB1","AB1","CD2")
year <- c(2005,2008,2005,2010,2007,2009,2009,2007,2000,2010)
df <- data.frame(ID, year)
df
ID year
1  AB1 2005
2  AB1 2008
3  CD2 2005
4  AB3 2010
5  KK4 2007
6  AB3 2009
7  AB3 2009
8  AB1 2007
9  AB1 2000
10 CD2 2010

我想添加一个列xp,其中包含与ID匹配且具有比当前行更小的year值的行数。我正在寻找这样的东西:

df$xp <- nrow( ID == "ID in current row" & year < "year in current row" )

结果应该是:

ID year xp
1  AB1 2005  1
2  AB1 2008  3
3  CD2 2005  0
4  AB3 2010  2
5  KK4 2007  0
6  AB3 2009  0
7  AB3 2009  0
8  AB1 2007  2
9  AB1 2000  0
10 CD2 2010  1

我相信有更简洁的基础R或数据。表方法,但这里是使用dplyr和tidyr的方法。这种方法依赖于"非对等连接",dplyr目前不包括(但data.tablesqldf有),所以我做了一个笛卡尔连接,然后过滤,这对大数据来说效率较低。

library(dplyr);library(tidyr)
left_join(                        # join...
df,                             # each row of df...
df %>%                          # with each matching row of a table where...
left_join(df, by = "ID") %>%  # each row of df is joined to all the rows with same ID
filter(year.y < year.x) %>%   # and we only keep preceding years
count(ID, year = year.x),     # and we count how many there are per ID
by = c("ID", "year")) %>%
replace_na(list(n=0))           # and we replace the NA's with zeroes

下面是使用dplyrpurrr的方法:

library(dplyr)
library(purrr)
df %>%
group_by(ID) %>%
mutate(xp = map_int(year, function(x) sum(cur_data()$year < x)))

purrr::map_intyear列的所有元素运行匿名函数。dplyr::cur_data()以数据帧的形式返回当前组的数据。

以下是data.table解决方案:

library(data.table)
setDT(df)
df[, xp:=sapply(1:.N, (x) sum(year < year[x])), by=ID][]
#>      ID year xp
#>  1: AB1 2005  1
#>  2: AB1 2008  3
#>  3: CD2 2005  0
#>  4: AB3 2010  2
#>  5: KK4 2007  0
#>  6: AB3 2009  0
#>  7: AB3 2009  0
#>  8: AB1 2007  2
#>  9: AB1 2000  0
#> 10: CD2 2010  1

问题中的伪代码几乎可以直接转换为SQL。我们对满足指定条件的df进行左自连接。我们按行分组,并对非空的连接元素进行计数。

library(sqldf)
sqldf("select a.*, count(b.ID) xp
from df a 
left join df b on a.ID = b.ID and b.year < a.year
group by a.rowid")

最新更新