r语言 - 为什么我没有得到数据集的全部赞美?



我有 2 个数据集,一个是另一个数据集的子集。我试图在较大的数据集中找到较小数据集的补充。我的意思是一个数据集,它的所有行都在较大的行中,而不是在较小的行中。我试图用:

df3<-setdiff(df1,df2)

但它并没有给我完整的赞美数据集。

nrow(df3)+nrow(df2)!=nrow(df1)

什么是问题?我不能放我的数据集,因为它们太大了,但这是它们的str:

df2
'data.frame':   8185 obs. of  17 variables:
$ SAMPN    : Factor w/ 1867 levels "    4","    5",..: 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "names")= chr  NA "24" "25" "26" ...
$ PERNO    : Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 1 1 2 2 2 2 ...
..- attr(*, "names")= chr  NA "24" "25" "26" ...
$ PLANO    : Factor w/ 28 levels " 2"," 3"," 4",..: 1 2 3 4 5 6 1 2 3 4 ...
..- attr(*, "names")= chr  NA "24" "25" "26" ...
$ TPURP    : Factor w/ 22 levels "(1) Working at home (for pay)",..: 16 14 4 5 9 12 9 5 3 5 ...
..- attr(*, "names")= chr  NA "24" "25" "26" ...
$ loop     : Factor w/ 8 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr  NA "24" "25" "26" ...
$ ARR_MIN  : Factor w/ 60 levels " 0"," 1"," 2",..: 25 21 11 31 31 51 22 53 11 56 ...
..- attr(*, "names")= chr  NA "24" "25" "26" ...
$ ARR_HR   : Factor w/ 24 levels " 1"," 2"," 3",..: 9 18 19 19 20 20 12 12 13 13 ...
..- attr(*, "names")= chr  NA "24" "25" "26" ...
$ start_hr : Factor w/ 24 levels " 1"," 2"," 3",..: 8 18 19 19 20 20 12 12 13 13 ...
..- attr(*, "names")= chr  NA "24" "25" "26" ...
$ start_min: Factor w/ 60 levels " 0"," 1"," 2",..: 35 6 6 26 1 41 19 29 1 46 ...
..- attr(*, "names")= chr  NA "24" "25" "26" ...
$ walk     : chr  "-1.00383132952532" "-0.926581782419858" "-1.02631368170796" "-0.932791692585498" ...
$ car      : chr  "2.07437681481379" "1.14501550876385" "1.11864841001179" "0.989597814702681" ...
$ bus      : chr  "-0.766918118637934" "-0.955021318273173" "-0.936196906716972" "-0.995116987781044" ...
$ MODE1    : Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr  NA "24" "25" "26" ...
$ utipassen: Factor w/ 11665 levels "-0.00013173196102555",..: 1439 10982 10259 11235 9871 5775 5387 9953 6000 10399 ...
..- attr(*, "names")= chr  NA "24" "25" "26" ...
$ HHVEH    : Factor w/ 9 levels "0","1","2","3",..: 3 3 3 3 3 3 3 3 3 3 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ VEHLIC   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ licence2 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...

DF1

'data.frame':   14693 obs. of  17 variables:
$ SAMPN    : Factor w/ 1867 levels "    4","    5",..: 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ PERNO    : Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 1 1 2 2 2 2 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ PLANO    : Factor w/ 28 levels " 2"," 3"," 4",..: 1 2 3 4 5 6 1 2 3 4 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ TPURP    : Factor w/ 22 levels "(1) Working at home (for pay)",..: 16 14 4 5 9 12 9 5 3 5 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ loop     : Factor w/ 8 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ ARR_MIN  : Factor w/ 60 levels " 0"," 1"," 2",..: 25 21 11 31 31 51 22 53 11 56 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ ARR_HR   : Factor w/ 24 levels " 1"," 2"," 3",..: 9 18 19 19 20 20 12 12 13 13 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ start_hr : Factor w/ 24 levels " 1"," 2"," 3",..: 8 18 19 19 20 20 12 12 13 13 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ start_min: Factor w/ 60 levels " 0"," 1"," 2",..: 35 6 6 26 1 41 19 29 1 46 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ walk     : Factor w/ 11665 levels "-0.000581433567566935",..: 5607 3104 6055 3192 1894 7541 9111 637 8958 8634 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ car      : Factor w/ 11665 levels "-0.00234049683698745",..: 11335 7668 7255 4911 8856 5412 4359 8146 6061 5818 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ bus      : Factor w/ 11665 levels "-0.00101509639366457",..: 4839 7258 6826 8249 588 2755 3725 720 2918 2526 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ MODE1    : Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ utipassen: Factor w/ 11665 levels "-0.00013173196102555",..: 2135 9762 7576 10524 6412 8409 7819 6659 8758 7961 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ HHVEH    : Factor w/ 9 levels "0","1","2","3",..: 3 3 3 3 3 3 3 3 3 3 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ VEHLIC   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...
$ licence2 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr  "23" "24" "25" "26" ...

数据主管:

DF2:

structure(list(SAMPN = c("    4", "    4", "    4", "    4", 
"    4", "    4"), PERNO = structure(c(1L, 1L, 1L, 1L, 1L, 1L
), .Names = c(NA, "24", "25", "26", "27", NA), .Label = c("1", 
"2", "3", "4", "5", "6", "7", "8"), class = "factor"), PLANO = structure(1:6, .Names = c(NA, 
"24", "25", "26", "27", NA), .Label = c(" 2", " 3", " 4", " 5", 
" 6", " 7", " 8", " 9", "10", "11", "12", "13", "14", "15", "16", 
"17", "29", "18", "19", "20", "21", "22", "23", "24", "25", "26", 
"27", "28"), class = "factor"), TPURP = structure(c(16L, 14L, 
4L, 5L, 9L, 12L), .Names = c(NA, "24", "25", "26", "27", NA), .Label = c("(1) Working at home (for pay)", 
"(10) Other, specify - transportation", "(11) Work/Business related", 
"(12) Service Private Vehicle", "(13) Routine Shopping", "(14) Shopping for major purchases", 
"(15) Household errands", "(16) Personal Business", "(17) Eat meal outside of home", 
"(18) Health care", "(19) Civic/Religious activities", "(2) All other home activities", 
"(20) Recreation/Entertainment", "(21) Visit friends/relative", 
"(24) Loop trip", "(3) Work/Job", "(4) All other activities at work", 
"(5) Attending class", "(6) All other activities at school", 
"(7) Change type of transportation/transfer", "(8) Dropped off passenger", 
"(9) Picked up passenger"), class = "factor"), loop = structure(c(2L, 
2L, 2L, 2L, 2L, 2L), .Names = c(NA, "24", "25", "26", "27", NA
), .Label = c("1", "2", "3", "4", "5", "6", "7", "8"), class = "factor")), row.names = c(NA, 
6L), class = "data.frame")

DF1:

structure(list(SAMPN = c("    4", "    4", "    4", "    4", 
"    4", "    4"), PERNO = structure(c(`23` = 1L, `24` = 1L, 
`25` = 1L, `26` = 1L, `27` = 1L, `28` = 1L), .Label = c("1", 
"2", "3", "4", "5", "6", "7", "8"), class = "factor"), PLANO = structure(1:6, .Names = c("23", 
"24", "25", "26", "27", "28"), .Label = c(" 2", " 3", " 4", " 5", 
" 6", " 7", " 8", " 9", "10", "11", "12", "13", "14", "15", "16", 
"17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", 
"28", "29"), class = "factor"), TPURP = structure(c(`23` = 16L, 
`24` = 14L, `25` = 4L, `26` = 5L, `27` = 9L, `28` = 12L), .Label = c("(1) Working at home (for pay)", 
"(10) Other, specify - transportation", "(11) Work/Business related", 
"(12) Service Private Vehicle", "(13) Routine Shopping", "(14) Shopping for major purchases", 
"(15)Household erran ds", "(16) Personal Business", "(17) Eat meal outside of home", 
"(18) Health care", "(19) Civic/Religious activities", "(2) All other home activities", 
"(20) Recreation/Entertainment", "(21) Visit friends/relative", 
"(24) Loop trip", "(3) Work/Job", "(4) All other activities at work", 
"(5) Attending class", "(6) All other activities at school", 
"(7) Change type of transportation/transfer", "(8) Dropped off passenger", 
"(9) Picked up passenger"), class = "factor"), loop = structure(c(`23` = 2L, 
`24` = 2L, `25` = 2L, `26` = 2L, `27` = 2L, `28` = 2L), .Label = c("1", 
"2", "3", "4", "5", "6", "7", "8"), class = "factor")), row.names = c("23", 
"24", "25", "26", "27", "28"), class = "data.frame")

根据?setdiff(从dplyr(

这些函数将覆盖 base 中提供的 set 函数,使其成为通用函数,以便为数据框和其他表提供高效版本。默认方法调用基本版本。请注意 intersect((、union(( 和 setdiff(( 会删除重复项。

因此,问题是setdiff只获得"df1"中不在"df2"中的unique元素。 它不会考虑重复的行。 为此,我们可能需要anti_join

library(dplyr)
anti_join(df1, df2, by = c("col1", "col2"))

如果我们通过所有列连接,并且列名相同,只需将by选项留空,它就会自动拾取所有列

anti_join(df1, df2)

最新更新