我正在处理非常混乱的家庭数据,因为孩子可能会被分为多个家庭。数据结构如下:
famid <- c("A","A","B","C","C","D","D")
kidid <- c("1","2","1","3","4","4","5")
df <- as.data.frame(cbind(famid, kidid))
我想确定我可以放弃哪些家庭,基于该家庭中所有孩子都被分组在另一个更大的家庭中的标准。
例如,族A包含Kid 1和Kid 2。家庭B包含孩子1。因为家庭B完全包含在家庭A中,我想放弃家庭B。
或者,家庭C包含孩子3和孩子4。家庭D包括孩子4和孩子5。两个家庭都不完全包含在另一个家庭中,所以我暂时不想放弃任何一个。
根据我的数据,每个孩子最多可以有6个家庭,每个家庭最多可以有8个孩子。这里有成千上万的家庭和成千上万的孩子。
我试图通过创建一个非常宽的data.frame来解决这个问题,该框架为每个学生创建一行,为孩子关联的每个家庭、孩子关联的各个家庭中的每个兄弟姐妹创建列,为每个关联的家庭创建一个附加列(sibgrp
),将所有兄弟姐妹连接在一起。但当我试图在连接字符串中搜索单个兄弟姐妹时,我发现我不知道如何做到这一点——grepl
不会将向量作为模式参数。
然后我开始研究相交和类似的函数,但这些函数将整个向量相互比较,而不是将向量内的观测值与该向量内的其他观测值进行比较。(意思是——我无法查找字符串df[1,2]
和字符串df[1,3]
之间的交集。Intersect标识df[2]
和df[3]
之间的交集)。
我试图改变我的想法来适应这种方法,这样我就可以比较兄弟姐妹的向量,假设我已经知道至少有一个兄弟姐妹是共享的。考虑到有多少不同的家庭,有多少家庭甚至没有一个共同的孩子,我甚至不知道如何开始这样做。
我在这里错过了什么?非常感谢您的反馈。非常感谢。
此函数也可用于执行任务。它返回一个包含可以删除的族的名称的字符向量。
test_function <- function(dataset){
## split the kidid on the basis of famid
kids_family <- split.default(dataset[['kidid']],f = dataset[['famid']])
family <- names(kids_family)
## This function generates all the possible combinations if we select any two families from family
combn_family <- combn(family,2)
family_removed <- character(0)
apply(combn_family,MARGIN = 2, function(x){
if (length(setdiff(kids_family[[x[1]]],kids_family[[x[2]]])) == 0)
family_removed <<- c(family_removed,x[1])
else if (length(setdiff(kids_family[[x[2]]],kids_family[[x[1]]])) == 0)
family_removed <<- c(family_removed,x[2])
})
return (family_removed)
}
> df <- data.frame(famid = c("A","A","B","C","C","D","D", "E", "E", "E", "F", "F"),
+ kidid = c(1, 2, 1, 3, 4, 4, 5, 7, 8, 9, 7, 9))
> test_function(df)
[1] "B" "F"
我试过setdiff
,但没有机会。我来发布这个费力的解决方案,希望有更好的方法。
# dependencies for melting tables and handling data.frames
require(reshape2)
require(dplyr)
# I have added two more cases to your data.frame
# kidid is passed as numeric (with quoted would have been changed to vector by default)
df <- data.frame(famid = c("A","A","B","C","C","D","D", "E", "E", "E", "F", "F"),
kidid = c(1, 2, 1, 3, 4, 4, 5, 7, 8, 9, 7, 9))
# let's have a look to it
df
famid kidid
1 A 1
2 A 2
3 B 1
4 C 3
5 C 4
6 D 4
7 D 5
8 E 7
9 E 8
10 E 9
11 F 7
12 F 9
# we build a contingency table
m <- table(df$famid, df$kidid)
# a family A only contains a family B, if A has all the elements of B,
# and at least one that B doesnt have
m
1 2 3 4 5 7 8 9
A 1 1 0 0 0 0 0 0
B 1 0 0 0 0 0 0 0
C 0 0 1 1 0 0 0 0
D 0 0 0 1 1 0 0 0
E 0 0 0 0 0 1 1 1
F 0 0 0 0 0 1 0 1
# an helper function to implement that and return a friendly data.frame
family_contained <- function(m){
res <- list()
for (i in 1:nrow(m))
# for each line in m, we calculate the difference to all other lines
res[[i]] <- t(apply(m[-i, ], 1, function(row) m[i, ] - row))
# here we test if all values are 0+ (ie if the selected family has all element of the other)
# and if at least one is >=1 (ie if the selected family has at least one element that the other doesnt have)
tab <- sapply(res, function(m) apply(m, 1, function(x) all(x>=0) & any(x>=1)))
# we format it as a table to have nice names
tab %>% as.table() %>%
# we melt it into a data.frame
melt() %>%
# only select TRUE and get rid of this column
filter(value) %>% select(-value) %>%
# to make things clear we name columns
`colnames<-`(c("this_family_is_contained", "this_family_contains"))
}
family_contained(m)
# this_family_is_contained this_family_contains
# 1 B A
# 2 F E
# finally you can filter them with
filter(df, !(famid %in% family_contained(m)$this_family_is_contained))