r语言 - 按组选择每个连续运行的第一行



我有按'ID'分组的数据。每个"身份证"在不同的日期有不同的药物。在每次连续运行'drug'时,我希望只保留第一行。这应该按组完成,即在每个"ID"内。数据中显示了两个示例:

ID        date    drug  
1  01/01/2020       A # first row in run 1 of 'A' for ID 1: keep 
1  07/01/2020       A # 2nd row in run 1 of 'A' for ID 1: drop
1  09/01/2020       B
1  15/01/2020       A
2  01/02/2020       C 
2  13/02/2020       D
2  17/02/2020       C # first row in run 2 of 'C' of ID 2: keep 
2  18/03/2020       C # 2nd row in run 2 of 'C' of ID 2: drop 
2  19/03/2020       E

所需输出:

ID     date             drug  
1      01/01/2020        A
1      09/01/2020        B
1      15/01/2020        A
2      01/02/2020        C
2      13/02/2020        D
2      17/02/2020        C
2      19/03/2020        E

我已经尝试了以下方法,但我不能使它起作用,因为它会删除那些来自同一组但后来出现的药物,例如它会下降15/01/2020,17/02/2020和18/03/2020,因为它只需要按组进行第一次观察。

df_selection <- df %>%   
group_by(ID) %>% 
arrange(ID,date) %>% 
group_by(ID, drug) %>% 
slice(1L) %>% 
arrange(ID,date)

我已经尝试了很多组合,但我不能使它工作。我真的很感激你的帮助!


另一个例子来演示一个'ID'中的最后一个'drug'与下一个'ID'中的第一个'drug'相同,这里是drug' B':

ID       date drug
1 01/01/2020    A
1 07/01/2020    A
1 09/01/2020    B # first row in a run of 'B' for ID 1: keep 
1 15/01/2020    B # 2nd row in a run of 'B' for ID 1: drop 
2 01/02/2020    B # first row in a run of 'B' for ID 2: keep 
2 13/02/2020    B # 2nd: drop
2 17/02/2020    B # 3rd: drop
2 18/03/2020    E
2 19/03/2020    E

使用data.table:

setDT(df)[rowid(rleid(drug)) == 1]
#    ID       date drug
# 1:  1 01/01/2020    A
# 2:  1 09/01/2020    B
# 3:  1 15/01/2020    A
# 4:  2 01/02/2020    C
# 5:  2 13/02/2020    D
# 6:  2 17/02/2020    C
# 7:  2 19/03/2020    E

如果在每个'ID'中考虑'drug'的运行,我们需要…

df[rowid(rleid(ID, drug)) == 1]

…处理以下情况:

ID       date drug
1:  1 01/01/2020    A
2:  1 07/01/2020    A
3:  1 09/01/2020    B
4:  1 15/01/2020    B # This 'B' belongs to 2nd run in ID 1 
5:  2 01/02/2020    B # This 'B' belongs to 1st run in ID 2
6:  2 13/02/2020    B
7:  2 17/02/2020    B
8:  2 18/03/2020    E
9:  2 19/03/2020    E
df %>% filter(drug != lag(drug, default = ""))

或者,如果您想保留一个ID的药物首次出现,即使它与先前ID的最后一种药物相匹配(例如,假设ID2的第一种药物是a,因此我们想保留它):

df %>%
filter(drug != lag(drug, default = "") |
ID != lag(ID, default = 0))

使用base Rrle

subset(df, with(rle(drug), !duplicated(rep(seq_along(values), lengths))))

希望此代码适用于您的一般情况

> subset(df, sequence(rle(drug)$lengths) == 1)
ID       date drug
1  1 01/01/2020    A
3  1 09/01/2020    B
4  1 15/01/2020    A
5  2 01/02/2020    C
6  2 13/02/2020    D
7  2 17/02/2020    C
9  2 19/03/2020    E

最新更新