我有一个这样的数据框架:
ID <- c("AB1","AB1","CD2","AB3","KK4","AB3","AB3","AB1","AB1","CD2")
year <- c(2005,2008,2005,2010,2007,2009,2009,2007,2000,2010)
df <- data.frame(ID, year)
df
ID year
1 AB1 2005
2 AB1 2008
3 CD2 2005
4 AB3 2010
5 KK4 2007
6 AB3 2009
7 AB3 2009
8 AB1 2007
9 AB1 2000
10 CD2 2010
我想添加一个列xp,其中包含与ID匹配且具有比当前行更小的year值的行数。我正在寻找这样的东西:
df$xp <- nrow( ID == "ID in current row" & year < "year in current row" )
结果应该是:
ID year xp
1 AB1 2005 1
2 AB1 2008 3
3 CD2 2005 0
4 AB3 2010 2
5 KK4 2007 0
6 AB3 2009 0
7 AB3 2009 0
8 AB1 2007 2
9 AB1 2000 0
10 CD2 2010 1
我相信有更简洁的基础R或数据。表方法,但这里是使用dplyr和tidyr的方法。这种方法依赖于"非对等连接",dplyr目前不包括(但data.table
和sqldf
有),所以我做了一个笛卡尔连接,然后过滤,这对大数据来说效率较低。
library(dplyr);library(tidyr)
left_join( # join...
df, # each row of df...
df %>% # with each matching row of a table where...
left_join(df, by = "ID") %>% # each row of df is joined to all the rows with same ID
filter(year.y < year.x) %>% # and we only keep preceding years
count(ID, year = year.x), # and we count how many there are per ID
by = c("ID", "year")) %>%
replace_na(list(n=0)) # and we replace the NA's with zeroes
下面是使用dplyr
和purrr
的方法:
library(dplyr)
library(purrr)
df %>%
group_by(ID) %>%
mutate(xp = map_int(year, function(x) sum(cur_data()$year < x)))
purrr::map_int
对year
列的所有元素运行匿名函数。dplyr::cur_data()
以数据帧的形式返回当前组的数据。
以下是data.table
解决方案:
library(data.table)
setDT(df)
df[, xp:=sapply(1:.N, (x) sum(year < year[x])), by=ID][]
#> ID year xp
#> 1: AB1 2005 1
#> 2: AB1 2008 3
#> 3: CD2 2005 0
#> 4: AB3 2010 2
#> 5: KK4 2007 0
#> 6: AB3 2009 0
#> 7: AB3 2009 0
#> 8: AB1 2007 2
#> 9: AB1 2000 0
#> 10: CD2 2010 1
问题中的伪代码几乎可以直接转换为SQL。我们对满足指定条件的df进行左自连接。我们按行分组,并对非空的连接元素进行计数。
library(sqldf)
sqldf("select a.*, count(b.ID) xp
from df a
left join df b on a.ID = b.ID and b.year < a.year
group by a.rowid")