R - 子集原始数据框:N 个随机观测值，50% 的 N 具有种族 E，50% 的 N 具有教育程度 E - R - subsetting original data frame: N random observations, 50% of N has ethnicity E and 50% of N has a education E 小贝子编程网

嗨，Stackoverflow用户，

我是R的新手，只学习了几个星期。我有一个数据框，其中包含 15 个关于人特征(例如种族、教育、原籍国(的字符串变量;一排就是一个人。

如何告诉 R 创建原始数据框的子集，以便此新数据框包含 N 个随机人员(已通过替换绘制(，N 的 50% 具有种族 ET，N 的 50% 具有教育教育？我知道基本的A( 和B(

A(我知道如何通过替换随机绘制 N 个观察值，正如这里和这里所建议的那样。例如：

df[sample(nrow(df), size=N, replace=TRUE), ]

B( 在另一篇文章中，有关于如何调整随机抽奖(无需替换(的示例。

df[ sample( which( df$Ethnicity== "ET" | df$Education= "ED" ) , N ) , ]

但是，我想知道如何进行更复杂的条件抽奖，即 N 的 50% 必须具有种族 ET，50% 的 N 必须具有教育 ED。因此，在这个大小为N的新样本中，这两个条件仅部分相交：

对于某些人来说，种族==ET和教育==ED，对于某些人来说，种族！=ET&教育==ED，对于某些人来说，种族==ET&Education！=ED，对于某些人来说，种族！=ET&Education！=ED。

一个简单的解决方案是为每个组合sample1/4，希望这个组合存在：

n  <- 1e2 / 4
y <- x[c(sample(which(x$et & x$ed), n, TRUE)
, sample(which(!x$et & x$ed), n, TRUE)
, sample(which(x$et & !x$ed), n, TRUE)
, sample(which(!x$et & !x$ed), n, TRUE)),]
table(y)
#       ed
#et      FALSE TRUE
#  FALSE    25   25
#  TRUE     25   25

如果不存在组合，您可以使用table获得每个组合的比例，如下所示：

n  <- 1e2
x  <- x[!x$et | x$ed,]
tt  <- table(x)
tt  <- tt * t(tt)
tt <- tt / rowSums(tt) 
tt <- tt / rep(colSums(tt), each=2)
tt <- round(proportions(tt)*n) #Since R 4.0.0: prop.table becomes proportions
#tt <- round(prop.table(tt)*n) #Here the target number might not be reached
y <- x[c(sample(which(!x$et & !x$ed), tt[1], TRUE)
, sample(which(x$et & !x$ed), tt[2], TRUE)
, sample(which(!x$et & x$ed), tt[3], TRUE)
, sample(which(x$et & x$ed), tt[4], TRUE)),]
table(y)
#       ed
#et      FALSE TRUE
#  FALSE    50    0
#  TRUE      0   50

数据：

set.seed(7)
n  <- 1e4
x  <- data.frame(et=sample(c(TRUE,FALSE), n, TRUE, c(.25,.75)), ed=sample(c(TRUE,FALSE), n, TRUE, c(.75,.25)))

R - 子集原始数据框:N 个随机观测值，50% 的 N 具有种族 E，50% 的 N 具有教育程度 E

相关内容

最新更新

热门标签：