根据另一个数据集定义的年龄范围概率过滤数据集



我有2个数据集。"未曝光"的数据集比"曝光"的数据集大得多。我想过滤"未曝光"数据集,以便Birth_Year分布与"曝光"数据集的分布相匹配。

例如,如果我在公开的数据集中有1个出生于2000年的人,我希望在未公开的数据集中有1个出生于2000年的人。如果我公开的数据集中有3个人出生于2001年,我希望在未公开的数据集中有3个人出生于2001年。

这是我暴露的数据集:

> dput(exposed)
structure(list(Status = c("Exposed", "Exposed", "Exposed", "Exposed", 
"Exposed", "Exposed", "Exposed", "Exposed", "Exposed", "Exposed", 
"Exposed"), Birth_Year = c("2001", "2001", "2000", "2002", "2001", 
"2005", "2004", "2006", "2006", "1999", "2007")), row.names = c(NA, 
-11L), class = "data.frame")

这是我未曝光的数据集:

> dput(unexposed)
structure(list(Status = c("Unexposed", "Unexposed", "Unexposed", 
"Unexposed", "Unexposed", "Unexposed", "Unexposed", "Unexposed", 
"Unexposed", "Unexposed", "Unexposed", "Unexposed", "Unexposed", 
"Unexposed", "Unexposed", "Unexposed", "Unexposed", "Unexposed", 
"Unexposed", "Unexposed"), Birth_Year = c(1997L, 2001L, 2003L, 
2003L, 2003L, 2003L, 1999L, 2003L, 2003L, 2003L, 2001L, 2003L, 
2003L, 2001L, 2004L, 2004L, 2005L, 2000L, 2005L, 2003L)), class = "data.frame", row.names = c(NA, 
-20L))

这里尝试使用dplyr/tidyverse来解决这个问题。这个想法是通过Birth_Day组对每个data.frame进行编号,然后我们简单地做一个inner_join来获得unexposed中相同数量的行。

library(tidyverse)
unexposed %>% 
group_by(Birth_Year) %>%  
mutate(rn=row_number()) %>% 
inner_join(exposed %>%  
mutate(Birth_Year = as.numeric(Birth_Year)) %>% 
group_by(Birth_Year) %>%  
mutate(rn=row_number()),
by=c("Birth_Year", "rn"),
suffix = c("", ".y")) %>% 
arrange(Birth_Year) %>% 
select(Status, Birth_Year)
# A tibble: 7 × 2
# Groups:   Birth_Year [5]
Status    Birth_Year
<chr>          <dbl>
1 Unexposed       1999
2 Unexposed       2000
3 Unexposed       2001
4 Unexposed       2001
5 Unexposed       2001
6 Unexposed       2004
7 Unexposed       2005

最新更新