所以假设我有一个如下的df:
DF1
ID Name Age Color
001 Stephanie 23 Red
002 DJ
003 Michelle 20 Blue
004 Danny 40 Green
005 Joey
006 Jesse
005 Jesse
让我们假设数据是1000+obs和30+vars
假设我对所有缺少信息的变量进行了子集
DF2(DF1的子集(
ID Name Age Color
002 DJ
005 Joey
006 Jesse
005 Jesse
在这份表格中,我找到了一些信息。
静止DF2
ID Name Age Color
002 DJ
005 Joey 42 Yellow
006 Jesse 39 Black
005 Jesse
我如何更新原始数据中的行,以获得我只能作为子集更新的所有信息。最终df应为:
ID Name Age Color
001 Stephanie 23 Red
002 DJ
003 Michelle 20 Blue
004 Danny 40 Green
005 Joey 42 Yellow
006 Jesse 39 Black
005 Jesse
添加一个new_id
列以在上明显匹配如何
#recreating dfs
df1 <- data.frame(ID = c(001, 002, 003, 004, 005, 006, 005),
Name = c("Stephanie", "DJ", "Michelle", "Danny", "Joey", "Jesse", "Jesse"),
Age = c(23,"", 20, 40, "", "", ""),
Color = c("Red", "", "Blue", "Green", "","",""), stringsAsFactors = FALSE)
new_id <- 1:nrow(df1)
df1 <- cbind(new_id, df1)
df2 <- data.frame(new_id = c(2,5,6,7),
ID = c(002, 005, 006, 005),
Name = c("DJ","Joey", "Jesse", "Jesse"),
Age = c("",42, 39, ""),
Color = c("", "Yellow","Black",""), stringsAsFactors = FALSE)
然后使用它作为索引来指定要替换的行:
#add rows back in
for (i in 1:nrow(df1)){
if (df1$new_id[i] %in% df2$new_id){
df1[i,] <- df2[which(df2$new_id == df1$new_id[i]),]
}
}
> df1
new_id ID Name Age Color
1 1 1 Stephanie 23 Red
2 2 2 DJ
3 3 3 Michelle 20 Blue
4 4 4 Danny 40 Green
5 5 5 Joey 42 Yellow
6 6 6 Jesse 39 Black
7 7 5 Jesse
或者,dplyr
解决方案(使用相同的df(:
> df2 %>% union(df1) %>% distinct(ID, Name, .keep_all = T) %>% arrange(new_id)
new_id ID Name Age Color
1 1 1 Stephanie 23 Red
2 2 2 DJ
3 3 3 Michelle 20 Blue
4 4 4 Danny 40 Green
5 5 5 Joey 42 Yellow
6 6 6 Jesse 39 Black
7 7 5 Jesse