我有2个数据集。"未曝光"的数据集比"曝光"的数据集大得多。我想过滤"未曝光"数据集,以便Birth_Year
分布与"曝光"数据集的分布相匹配。
例如,如果我在公开的数据集中有1个出生于2000年的人,我希望在未公开的数据集中有1个出生于2000年的人。如果我公开的数据集中有3个人出生于2001年,我希望在未公开的数据集中有3个人出生于2001年。
这是我暴露的数据集:
> dput(exposed)
structure(list(Status = c("Exposed", "Exposed", "Exposed", "Exposed",
"Exposed", "Exposed", "Exposed", "Exposed", "Exposed", "Exposed",
"Exposed"), Birth_Year = c("2001", "2001", "2000", "2002", "2001",
"2005", "2004", "2006", "2006", "1999", "2007")), row.names = c(NA,
-11L), class = "data.frame")
这是我未曝光的数据集:
> dput(unexposed)
structure(list(Status = c("Unexposed", "Unexposed", "Unexposed",
"Unexposed", "Unexposed", "Unexposed", "Unexposed", "Unexposed",
"Unexposed", "Unexposed", "Unexposed", "Unexposed", "Unexposed",
"Unexposed", "Unexposed", "Unexposed", "Unexposed", "Unexposed",
"Unexposed", "Unexposed"), Birth_Year = c(1997L, 2001L, 2003L,
2003L, 2003L, 2003L, 1999L, 2003L, 2003L, 2003L, 2001L, 2003L,
2003L, 2001L, 2004L, 2004L, 2005L, 2000L, 2005L, 2003L)), class = "data.frame", row.names = c(NA,
-20L))
这里尝试使用dplyr/tidyverse
来解决这个问题。这个想法是通过Birth_Day
组对每个data.frame
进行编号,然后我们简单地做一个inner_join
来获得unexposed
中相同数量的行。
library(tidyverse)
unexposed %>%
group_by(Birth_Year) %>%
mutate(rn=row_number()) %>%
inner_join(exposed %>%
mutate(Birth_Year = as.numeric(Birth_Year)) %>%
group_by(Birth_Year) %>%
mutate(rn=row_number()),
by=c("Birth_Year", "rn"),
suffix = c("", ".y")) %>%
arrange(Birth_Year) %>%
select(Status, Birth_Year)
# A tibble: 7 × 2
# Groups: Birth_Year [5]
Status Birth_Year
<chr> <dbl>
1 Unexposed 1999
2 Unexposed 2000
3 Unexposed 2001
4 Unexposed 2001
5 Unexposed 2001
6 Unexposed 2004
7 Unexposed 2005