在 R 中查找近似匹配的数值数据行



作为一个玩具示例,请考虑以下内容,其中我们有真实的数据xyxz的扰动版本,它们与混合行重合:

x = matrix(1:100, nrow = 100, ncol= 4 , byrow = FALSE)
y = x + matrix( .001 * rnorm(n = 400), nrow = 100, ncol= 4)
z = rbind(x,y)
z = z[sample(nrow(z)),]

我们如何在 R 中有效地查找或估计z中的匹配行?充其量,我只想获取属于x的行,或者每行只想获取来自xy的行,但不能同时获取两者。 我已经查看了包RecordLinkage但是在纯数字情况下,存在更有效的解决方案。此外,在我的设置中,我有 100K+ 行和 20 列,对完整数据集的compare.dedup调用需要太多内存。

编辑:我尝试了建议的方法:

setseed(100)
x = matrix( 1:100, nrow = 100, ncol= 4 ,byrow = FALSE)
y = x + matrix( .001 * rnorm(n = 400), nrow = 100, ncol= 4)
z = rbind(x,y)
#z = z[sample(nrow(z)),]

res = caret::findLinearCombos(t(z))
res$remove%>%sort

结果如下所示。我们看到我们得到 8.0 以及扰动的 8.00572 - 与 9 和 10 相同。它适用于某些对,但不适用于一般情况。

z[res$remove,1]%>% sort[1] 2.000000 3.000000 4.000000 4.000952 5.000000 5.001135 6.000000 6.000008 7.000000 7.001225 [11] 8.000000 8.000572 8.997471 9.000000 10.000000 10.000135 10.999871 11.000000 12.000000 12.000113 [21] 12.999917 13.000000 13.998705 14.000000 15.000000 15.001787 16.000000 16.002099 17.000000 17.000232 [31] 18.000000 18.000062 19.000000 19.000354 20.000000
20.000725 21.000000 21.000268 21.999909 22.000000 [41] 22.999861 23.000000 24.000000 24.001042 25.000000 25.000478 26.000000 26.000567 27.000000 27.000610 [51] 27.999102 28.000000 29.000000 29.000263 30.000000 30.001195 31.000000 31.000267 32.000000 32.000999 [61] 33.000000 33.001137 34.000000 34.000603 35.000000 35.001352 36.000000 36.001945 36.998791 37.000000 [71] 38.000000 38.003187 38.999596 39.000000 39.997090 40.000000 40.999639 41.000000 42.000000 42.000220 [81] 43.000000 43.000062 44.000000 44.000170 45.000000 45.000222 45.998763 46.000000 47.000000 47.001132 [91] 47.999887 48.000000 49.000000 49.002185 50.000000 50.000743 51.000000 51.002065 52.000000 52.001307 [101] 52.998977 53.000000 53.999975 54.000000 54.999356 55.000000 56.000000 56.001569 57.000000 57.000013 [111] 58.000000 58.001158 58.999849 59.000000 59.999147 60.000000 61.000000 61.001045 61.999888 62.000000 [121] 62.998223 63.000000 63.999040 64.000000 64.998698 65.000000 66.000000 66.000069 66.999729 67.000000 [131] 68.000000 68.000566 69.000000 69.000426 69.998899 70.000000 71.000000 71.000105 71.999957 72.000000 [141] 73.000000 73.000644 73.999902 74.000000 74.999892 75.000000 76.000000 76.000321 77.000000 77.000765 [151] 78.000000 78.000649 78.999644 79.000000 79.998975 80.000000 80.998300 81.000000 82.000000 82.001297 [161] 82.998977 83.000000 83.998629 84.000000 84.999534 85.000000 85.998803 86.000000 87.000000 87.001064 [171

] 87.999871 88.000000 88.998835 89.000000 89.998987 90.000000 91.000000 91.001467 92.000000 92.001252 [181] 93.000000 93.000839 93.998372 94.000000 94.999120 95.000000 95.999964 96.000000 96.999911 97.000000 [191] 98.000000 98.002148 99.000000 99.000914 100.000000 100.001824

插入符号包有一个函数findLinearCombos(),它可以帮助您识别矩阵列之间的线性依赖关系(通过省略行并每次计算排名),在您的情况下,您希望转置矩阵。我会试一试。

最新更新