给定[不一致]约束条件下计算最近邻距离的R函数

我有树木生长测量数据(直径和高度)在已知的X & &;Y坐标。我想确定每棵树的最近邻居与大小相等或更大。

我见过其他SE问题询问最近邻计算(例如，见这里，这里，这里，这里等)，但没有指定要搜索的最近邻的约束。

是否存在一个函数(或其他方法)，允许我确定一个点的最近邻居的距离，假设最近的点满足一些条件(例如，必须等于或大于感兴趣的点的大小)?

[一组更复杂的约束会更有帮助…]

以我的例子为例:指定一棵树必须也与感兴趣的树在同一地块上，或者与感兴趣的树是同一物种

我将使用非相等连接和data.table

编辑:(供参考，这需要数据。表1.9.7(可以从github获取)

EDIT2:使用数据的副本。表，因为它似乎是通过自己的阈值连接的。我将在将来修复这个问题，但是现在可以正常工作。

library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2, 
                            height <= thresh1), 
              allow.cartesian = TRUE]
# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]
# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]

最终数据集包含每对，根据给定的阈值

编辑:

附加变量:

如果你想加入额外的参数，这允许你这样做，(如果你额外加入像情节或物种这样的东西，可能会更快，因为笛卡尔连接会更小)

下面是连接两个额外的分类变量，物种和情节的例子:

 library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2, 
                            height <= thresh1,
                            species == species,
                            plot == plot),
              nomatch = NA,
              allow.cartesian = TRUE]
# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]
# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
final
> final
      id         x         y    height species plot  height.1 i.id       i.x       i.y  i.height        dist
  1:   3 0.4837348 0.4325731  91.53387       C    2 111.53387  486 0.5549221 0.4395687 101.53387 0.005116568
  2:  13 0.8267298 0.3137061  94.58949       C    2 114.58949  754 0.8408547 0.2305702 104.58949 0.007111079
  3:  29 0.2905729 0.4952757  89.52128       C    2 109.52128  333 0.2536760 0.5707272  99.52128 0.007054301
  4:  37 0.4534841 0.5249862  89.95493       C    2 109.95493   72 0.4807242 0.6056771  99.95493 0.007253044
  5:  63 0.1678515 0.8814829  84.77450       C    2 104.77450  289 0.1151764 0.9728488  94.77450 0.011122404
 ---                                                                                                        
994: 137 0.8696393 0.2226888  66.57792       C    2  86.57792  473 0.4467795 0.6881008  76.57792 0.395418724
995: 348 0.3606249 0.1245749 110.14466       A    2 130.14466  338 0.1394011 0.1200064 120.14466 0.048960849
996: 572 0.6562758 0.1387882 113.61821       A    2 133.61821  348 0.3606249 0.1245749 123.61821 0.087611511
997: 143 0.9170504 0.1171652  71.39953       C    3  91.39953  904 0.6954973 0.3690599  81.39953 0.112536771
998: 172 0.6834473 0.6221259  65.52187       A    2  85.52187  783 0.4400028 0.9526355  75.52187 0.168501816
>

注:在最终答案中，有列height和height。1，后者似乎是数据的结果。

mems -efficient解决方案

@theforestecologist的一个问题是，这需要大量的内存来完成，

(在这种情况下，有额外的42列乘以笛卡尔连接，这导致了内存问题)，

然而，我们可以通过使用。eachi(我相信)以一种更有效的内存方式来做到这一点。因为我们不会将整个表加载到内存中。解决方案如下:

library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# In order to navigate the sometimes unusual nature of scoping inside a
# data.table join, I set the second table to have its own uniquely named id
dtree_self[,id2 := id]
dtree_self[,id := NULL]

# for clarity inside the brackets, 
# I define the squared euclid distance
eucdist <- function(x,xx,y,yy) (x - xx)**2 + (y - yy)**2 
# Join on a range, must be a cartesian join, since there are many candidates 
# Return a table of matches, using .EACHI to keep from loading too much into mem
test <- dtree[dtree_self, on = .(height >= thresh2, 
                                 height <= thresh1,
                                 species,
                                 plot),
              .(id2, id[{z = eucdist(x,i.x,y,i.y); mz <- min(z[id2 != id]); mz == z}]),
              by = .EACHI,
              nomatch = NA,
              allow.cartesian = TRUE]
# join the metadata back onto each id
test <- dtree[test, on = .(id = V2), nomatch = NA]
test <- dtree[test, on = .(id = id2), nomatch = NA]
> test
        id          x          y    height species plot i.id        i.x        i.y  i.height i.species i.plot i.height.2 i.height.1 i.species.1 i.plot.1
   1:    1 0.17622235 0.66547312  84.68450       B    2  965 0.17410840 0.63219350  93.60226         B      2   74.68450   94.68450           B        2
   2:    2 0.04523011 0.33813054  89.46288       B    2  457 0.07267547 0.35725229  88.42827         B      2   79.46288   99.46288           B        2
   3:    3 0.24096368 0.32649256 103.85870       C    3  202 0.20782303 0.38422814  94.35898         C      3   93.85870  113.85870           C        3
   4:    4 0.53160655 0.06636979 101.50614       B    1  248 0.47382417 0.01535036 103.74101         B      1   91.50614  111.50614           B        1
   5:    5 0.83426727 0.55380451 101.93408       C    3  861 0.78210747 0.52812487  96.71422         C      3   91.93408  111.93408           C        3

编辑:

附加变量:

mems -efficient解决方案

相关内容

最新更新

热门标签：