给定一个更大的数据帧,大约有300k+行和14列,格式如下:
df <- data.frame(team_id = c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10)),
year = rep(c(1954:1963), 5), members= c(0,0,0,1,1,1,2,0,0,0,0,0,2,1,1,1,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0),
size = c(rep(60,8),50,50,rep(40,7),50,50,70,rep(30,10),rep(99,6),110,101,101,101,rep(80,9),66) )
其目的是为每支球队创建一个新的向量,包含所有成员离开后(成员从2或1变为0(的大小差异,从下一个不同的大小中减去球员最后离开的年份的大小。应该显示变化的方向,这样就不需要绝对值。到目前为止,我所取得的成就是:
df2 <- df %>% arrange(team_id,year) %>%
group_by(team_id) %>%
mutate(sizediff = if_else(members == 1 & lead(members) == 0 | members == 2 & lead(members) == 0,1,0, missing = 0) )
然而,我希望与未来的大小有差异,而不是sizediff向量中的值1。也许从长格式改为宽格式或有条件地重新安排年份矢量会有所帮助,但我被卡住了。我想要实现的是:
aim <- data.frame(team_id = c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10)),
year = rep(c(1954:1963), 5), members= c(0,0,0,1,1,1,2,0,0,0, 0,0,2,1,1,1,0,0,0,0, 1,1,1,1,1,1,1,1,1,1, 0,1,1,1,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,0,0 ) ,
size = c(57,rep(60,7),50,50,rep(40,7),50,50,70,rep(30,10),rep(99,6),110,101,101,101,88,rep(80,8),66),
sizediff = c(rep(0,6),-10,rep(0,3),rep(0,5),10,rep(0,4),rep(0,10),rep(0,3),11,rep(0,6),rep(0,7),-14,rep(0,2)) )
这是您想要的东西吗?
df %>%
arrange(team_id, year) %>%
mutate(diff = if_else((members> 0 & dplyr::lead(members, n=1)==0), size, 0)) %>%
group_by(team_id) %>%
mutate(diff = ifelse(diff>0, dplyr::last(size)-size, NA))
尝试这种自定义方法:
library(dplyr)
df %>%
group_by(team_id) %>%
mutate(sizediff = {
sizediff = rep(0, n())
inds <- which(members %in% c(1, 2) & lead(members) == 0)[1]
sizediff[inds] <- size[which(row_number() > inds & size != size[inds])[1]] - size[inds]
sizediff
}) -> result
result
# team_id year members size sizediff
# <dbl> <int> <dbl> <dbl> <dbl>
# 1 1 1954 0 60 0
# 2 1 1955 0 60 0
# 3 1 1956 0 60 0
# 4 1 1957 1 60 0
# 5 1 1958 1 60 0
# 6 1 1959 1 60 0
# 7 1 1960 2 60 -10
# 8 1 1961 0 60 0
# 9 1 1962 0 50 0
#10 1 1963 0 50 0
# … with 40 more rows
我们首先将sizediff
初始化为0,inds
用于查找成员的剩余位置。我们计算size
与改变并更新inds
位置的下一个值的差。