测试R中的多个相同列



是否有短途方法可以通过多个列来测试身份?例如,通过此输入

data=data.table(one=c(1,2,3,4), two=c(7,8,9,10), three=c(1,2,3,4), four=c(1,2,3,4) )

是否有东西可以返回与数据$一相同的所有列?像

allcolumnsidentity(data$one, data) # compares all columns with respect to data$one 

应该返回(true,false,true,true),因为数据$三,数据$四与数据$一相同。

我看到了相同的()和comapre()命令,但是它们处理了两个列之间的比较。有一种通用的方法吗?

最好的祝福

以下还有3个可能的解决方案A在更大的数据集上的基准

n <- 1e6
data=data.table(one=rep(1:4, n), 
                two=rep(7:10, n),
                three=rep(1:4, n), 
                four=rep(1:4, n))
library(microbenchmark)
microbenchmark(
              apply(data, 2, identical, data$one) ,
              colSums(data == data$one) == nrow(data),
              colSums(as.matrix(data) == data$one) == nrow(data),
              data[, lapply(.SD, function(x) sum(x == data$one) == .N)],
              data[, lapply(.SD, function(x) identical(x, data$one))]
)

# Unit: milliseconds
#                                                      expr        min          lq        mean      median          uq        max neval
#                       apply(data, 2, identical, data$one)  352.58769  414.846535  457.767582  437.041789  521.895046  643.77981   100
#                   colSums(data == data$one) == nrow(data) 1264.95548 1315.882084 1335.827386 1326.250976 1346.501505 1466.64232   100
#        colSums(as.matrix(data) == data$one) == nrow(data)  110.05474  114.618818  125.116033  121.631323  126.912647  185.69939   100
# data[, lapply(.SD, function(x) sum(x == data$one) == .N)]   75.36791   77.960613   85.599088   79.327108   89.369938  156.03422   100
#   data[, lapply(.SD, function(x) identical(x, data$one))]    7.00261    7.448851    8.687903    8.776724    9.491253   15.72188   100

,这里有一些比较,以防您有许多列

n <- 1e7
set.seed(123)
data <- data.table(matrix(sample(n, replace = TRUE), ncol = 400))
microbenchmark(
               apply(data, 2, identical, data$V1) ,
               colSums(data == data$V1) == nrow(data),
               colSums(as.matrix(data) == data$V1) == nrow(data),
               data[, lapply(.SD, function(x) sum(x == data$V1) == .N)],
               data[, lapply(.SD, function(x) identical(x,data$V1))]
)
# Unit: milliseconds
#                                                     expr       min        lq      mean    median        uq       max neval
#                       apply(data, 2, identical, data$V1) 176.65997 185.23895 235.44088 234.60227 253.88658 331.18788   100
#                   colSums(data == data$V1) == nrow(data) 680.48398 759.82115 786.64634 774.86919 804.91661 987.26456   100
#        colSums(as.matrix(data) == data$V1) == nrow(data)  60.62470  62.86181  70.41601  63.75478  65.16708 120.30393   100
# data[, lapply(.SD, function(x) sum(x == data$V1) == .N)]  83.95790  86.72680  90.45487  88.46165  90.04441 142.08614   100
#   data[, lapply(.SD, function(x) identical(x, data$V1))]  40.86718  42.65486  45.06100  44.29602  45.49430  91.57465   100

好,这比预期容易:)

简单地使用这样的应用:

apply(data, 2, identical, data$one) 
# returned:
# one   two  three  four 
# TRUE FALSE  TRUE  TRUE 

相关内容

  • 没有找到相关文章

最新更新