如何在 r 中使用 dplyr 循环计算多个实例的距离



我有位置数据,下面显示了一个例子,其中time是记录每个位置的时间,ref是每个点的参考,x是每个点的x坐标,y是每个点的y坐标。

> print(df)
time ref     x     y
1     1   1 92.80 49.58
2     1   2 90.20 96.02
3     1   3 91.61 80.05
4     1   4 68.75 20.56
5     1   5  5.53 35.27
6     1   6 39.85 85.39
7     1   7 12.04 87.43
8     1   8 42.98 56.53
9     1   9 19.14 63.56
10    1  10 25.72  7.62
11    2   1 50.39  7.16
12    2   2 17.71  7.15
13    2   3 52.96 34.87
14    2   4 52.70 97.07
15    2   5 70.88 44.88
16    2   6 32.12 71.82
17    2   7 24.15 22.77
18    2   8 18.06 31.03
19    2   9 70.55 92.42
20    2  10 45.05 79.67

我想采取的步骤如下(步骤1至4成功完成(

  1. 多次复制 x 和 y 坐标,误差很小
  2. 计算每个时刻每个点之间的距离
  3. 计算每个时间实例的这 45 个距离的总和
  4. 在步骤 1 中创建的所有不同迭代中重复此过程
  5. 创建一个包含所有这些信息的新数据帧

第 1 步。

set.seed(456) #set seed to get consistent results
n <- 3 # this is 3 for this example but would likely be 1000 or 10000 and refers to the number of simulations

for(i in seq(5,(2*n+3),2)){ #create simulations of the xy data set
df[,i] = df[,3] + rnorm(length(df[,2]),0,1) #replicates the x column 
df[,i+1] = df[,4] + rnorm(length(df[,3]),0,1) # replicates the y column
}

这段代码有效且易于调整,并给了我以下 df。前 4 列与上面完全相同。V5 和 V6 是 n=1 的 x 和 y 坐标,与原始 x 和 y 有很小的误差(你可以看到这些值有多相似( V7 和 V8 是 x 和 y 表示 n=2,V9 和 V10 是 x 和 y 表示 n=3

print(df)
time ref     x     y        V5        V6        V7        V8        V9       V10
1     1   1 92.80 49.58 91.456479 49.105396 92.771058 47.325290 91.720518 49.698151
2     1   2 90.20 96.02 90.821776 94.302691 90.593037 95.037940 89.758626 96.889903
3     1   3 91.61 80.05 92.410875 78.623170 91.360386 79.849432 93.630635 79.958064
4     1   4 68.75 20.56 67.361108 20.768236 68.833450 21.455930 68.822856 20.628899
5     1   5  5.53 35.27  4.815643 35.234164  7.608875 35.226455  6.238817 33.587573
6     1   6 39.85 85.39 39.525939 86.524285 39.970852 87.037308 40.700509 86.506956
7     1   7 12.04 87.43 12.730643 86.967145 12.158149 88.993299 10.553803 86.078642
8     1   8 42.98 56.53 43.230548 56.201616 43.750054 55.098622 43.900530 55.992833
9     1   9 19.14 63.56 20.147352 65.044539 17.964598 63.015406 19.288329 63.189886
10    1  10 25.72  7.62 26.293235  6.530622 26.129039  6.848746 25.483132  7.974012
11    2   1 50.39  7.16 49.474189  6.631206 49.725049  6.990012 49.916764  6.350175
12    2   2 17.71  7.15 19.021097  6.556207 17.453475  7.109238 17.040794  6.970275
13    2   3 52.96 34.87 53.948726 32.871084 53.638782 33.149460 54.318527 33.722340
14    2   4 52.70 97.07 54.353929 97.366153 53.596845 98.514106 54.112918 97.166242
15    2   5 70.88 44.88 69.439195 45.050625 71.498356 44.859985 70.147226 45.694700
16    2   6 32.12 71.82 34.067356 73.635652 32.851454 72.090232 32.039448 72.802941
17    2   7 24.15 22.77 25.886936 22.109397 23.736825 22.657066 24.960197 23.620843
18    2   8 18.06 31.03 18.447483 30.889748 19.617813 30.175112 18.562588 32.237347
19    2   9 70.55 92.42 72.830034 91.996021 71.091699 91.386259 71.674023 90.986222
20    2  10 45.05 79.67 46.587883 79.631264 45.627150 79.892027 44.878720 78.569054

步骤 2

我使用 dplyr 创建了代码,该代码按时间对数据进行分组,然后计算每个参考点之间的距离(此代码显示在步骤 3 中(。 有 10 个参考点,需要计算 45 个距离(10 个选择 2(。

第 3 步对于每组时间,我想计算所有 45 个距离的总和。 步骤 2 和 3 在以下代码中,该代码已制成函数

sumdist = function(data) {
names(data)[3]<-paste("x") #renames 3rd column x to assist for loop
names(data)[4]<-paste("y") #renames 4th column y to assist for loop
data = data %>% 
group_by(time) %>% 
mutate(dist1 = sqrt((x[which(ref == 1)] - x)^2 + (y[which(ref == 1)] - y)^2)) %>% #distance beween all points and point 1
mutate(dist2 = sqrt((x[which(ref == 2)] - x)^2 + (y[which(ref == 2)] - y)^2)) %>% #distance beween all points and point 2
mutate(dist3 = sqrt((x[which(ref == 3)] - x)^2 + (y[which(ref == 3)] - y)^2)) %>% #distance beween all points and point 3
mutate(dist4 = sqrt((x[which(ref == 4)] - x)^2 + (y[which(ref == 4)] - y)^2)) %>% #distance beween all points and point 4
mutate(dist5 = sqrt((x[which(ref == 5)] - x)^2 + (y[which(ref == 5)] - y)^2)) %>% #distance beween all points and point 5
mutate(dist6 = sqrt((x[which(ref == 6)] - x)^2 + (y[which(ref == 6)] - y)^2)) %>% #distance beween all points and point 6
mutate(dist7 = sqrt((x[which(ref == 7)] - x)^2 + (y[which(ref == 7)] - y)^2)) %>% #distance beween all points and point 7
mutate(dist8 = sqrt((x[which(ref == 8)] - x)^2 + (y[which(ref == 8)] - y)^2)) %>% #distance beween all points and point 8
mutate(dist9 = sqrt((x[which(ref == 9)] - x)^2 + (y[which(ref == 9)] - y)^2)) %>% #distance beween all points and point 9
mutate(dist10 = sqrt((x[which(ref == 10)] - x)^2 + (y[which(ref == 10)] - y)^2)) %>% #distance beween all points and point 10
summarise(sumdistances = (sum(dist1,dist2,dist3,dist4,dist5,dist6,dist7,dist8,dist9,dist10))/2) #sum of all distances
print(data$sumdistances)
}

在我的 DF 上运行此函数时,它仅使用第一个 x 和 y 进行计算,但它可以工作。 生成长度为 2 的向量。 第一个值用于时间 1,第二个值用于时间 2

> sumdist(df) # this calculates it from the original x and y 
[1] 2706.592 2275.045

步骤 4

我现在想在我之前创建的多次迭代中重复此操作。对于我的实际数据集,n 将达到数千个,所以我需要自动化此过程

sumd = matrix(NA, nrow=2, ncol=n+1) # set collection matrix for nrow = number of time and #ncol = number simulations
for(i in 1:(n+1)) {
datas = df[,c(1,2,((1+2*i)),(2+(2*i))),] # extracts the time, ref along with x and y for each simulations
sumd[i] = sumdist(datas) # runs function on each simulated data set
}

因为我的函数在最后打印计算的数据,所以运行代码表明它已经计算了我想要的东西

> for(i in 1:(n+1)) {
+   datas = df[,c(1,2,((1+2*i)),(2+(2*i))),] # extracts the time, ref along with x and y for each simulations
+   sumd[i] = sumdist(datas) # runs function on each simulated data set
+ }
[1] 2706.592 2275.045
[1] 2695.796 2282.284
[1] 2713.277 2288.517
[1] 2719.587 2273.316

底部的 4 行是我想计算的,尽管不是完全按照这个顺序

理想情况下,它应该看起来更像这样

time       V2       V3       V4       V5
1    1 2706.592 2695.796 2713.277 2719.587
2    2 2275.045 2282.284 2288.517 2273.316

步骤 5

但是我的一半矩阵仍然包含 NA,并且填充如下:

> print(sumd)
[,1]     [,2] [,3] [,4]
[1,] 2706.592 2713.277   NA   NA
[2,] 2695.796 2719.587   NA   NA

我收到的错误是这样的

Warning messages:
1: In sumd[i] <- sumdist(datas) :
number of items to replace is not a multiple of replacement length
2: In sumd[i] <- sumdist(datas) :
number of items to replace is not a multiple of replacement length
3: In sumd[i] <- sumdist(datas) :
number of items to replace is not a multiple of replacement length
4: In sumd[i] <- sumdist(datas) :
number of items to replace is not a multiple of replacement length

这似乎直截了当地说明了出了什么问题。 我创建的矩阵不适合输出。我尝试以多种方式更改矩阵以使其适合,但是我一直收到错误,最终似乎无法获得包含我想要的信息的矩阵或数据帧。

编辑 - 我现在了解了初始代码中的错误,该错误阻止了它的工作,这自然非常简单。sumd[i]应改为sumd[,i]

好的,在您编辑后,我意识到我误解了您的问题。

我认为您的设计的问题在于您想提前创建列。显然,它们不能有一个正确的名称,这使得识别x和y有点困难。

这是我的建议:添加高斯噪声并动态计算总和。

首先,让我们重新创建数据帧(下次可以共享此代码或一些dput输出,这样可以更轻松地提供帮助(。

library(tidyverse)
df = read.table(header=TRUE, text="
time ref     x     y
1     1   1 92.80 49.58
2     1   2 90.20 96.02
3     1   3 91.61 80.05
4     1   4 68.75 20.56
5     1   5  5.53 35.27
6     1   6 39.85 85.39
7     1   7 12.04 87.43
8     1   8 42.98 56.53
9     1   9 19.14 63.56
10    1  10 25.72  7.62
11    2   1 50.39  7.16
12    2   2 17.71  7.15
13    2   3 52.96 34.87
14    2   4 52.70 97.07
15    2   5 70.88 44.88
16    2   6 32.12 71.82
17    2   7 24.15 22.77
18    2   8 18.06 31.03
19    2   9 70.55 92.42
20    2  10 45.05 79.67")

然后,让我们重写距离计算,因为我发现您的代码有点多余。编程经验法则:干。如果你重复一个结构超过 3 次,你可能应该写一些函数。

options(dplyr.summarise.inform=FALSE) #don't care about those warnings
distance = function(x1,x2,y1,y2) sqrt(((x2-x1)^2)+((y2-y1)^2))
distance2 = function(x,y,.pred) distance(x, x[.pred], y, y[.pred])    
distance_sum = function(x, y, ref){
dists = map(1:10, ~distance2(x,y, which(ref == .x)))
sum(unlist(dists))/2
}

在这里,我可以在xy上重现您的结果:

df %>% 
group_by(time) %>% 
summarise(sum=distance_sum(x, y, ref))
#> # A tibble: 2 x 2
#>    time   sum
#>   <int> <dbl>
#> 1     1 2707.
#> 2     2 2275.

最后,我们可以复制一定次数,事先添加一些随机噪声。同样,结果值与您的值相同。

set.seed(456)
n <- 3 #or 10000
xx = rerun(n, {
df %>% 
mutate(x=x+rnorm(length(x),0,1), 
y=y+rnorm(length(y),0,1)) %>% 
group_by(time) %>% 
summarise(sum=distance_sum(x, y, ref)) %>% 
as.data.frame() #needed for the precision in the example, you can drop this line
})
xx
#> [[1]]
#>   time      sum
#> 1    1 2695.796
#> 2    2 2282.284
#> 
#> [[2]]
#>   time      sum
#> 1    1 2713.277
#> 2    2 2288.517
#> 
#> [[3]]
#>   time      sum
#> 1    1 2719.587
#> 2    2 2273.316

然后,您可以rbind列表并计算其上的一些统计信息:

xx %>% #this was run with n=25
reduce(rbind) %>% 
group_by(time) %>% 
summarise(sum_m=mean(sum), sum_sd=sd(sum))
#> # A tibble: 2 x 3
#>    time sum_m sum_sd
#>   <int> <dbl>  <dbl>
#> 1     1 2711.   22.2
#> 2     2 2280.   16.8

Created on 2020-06-18 by the reprex package (v0.3.0)
df <- tibble(
ref = rep(c(1, 2, 3), each = 5),
x = rnorm(15, 10, 8),
y = rnorm(15, 35, 20)
)
# Number of created points
n <- 3
# Putting x and y as point
df <- df %>%
mutate(point = map2(x, y, c)) 
# Adding noise to point
new_points <- seq_len(n)
names(new_points) <- new_points %>% str_c("point_", .)
new_cols <- new_points %>%
map(~list(rnorm(15), rnorm(15)) %>% transpose() %>% map(unlist)) %>%
map(~map2(.x, df$point, ~.x+.y)) %>%
as_tibble()
# Binding new points 
df <- df %>%
bind_cols(new_cols)
# Functions for calculating euclidian distance of point list
dList <- function(a, b)
b %>% 
map_dbl(~(a - .x)^2 %>% sum() %>% sqrt())
sumDistanceList <- function(l)
seq_len(length(l) - 1) %>%
map(~dList(l[[.x]], l[(.x+1):length(l)])) %>%
unlist() %>%
sum()
# Summarise
df %>%
group_by(ref) %>%
summarise(across(str_subset(names(.), "point_"), sumDistanceList)) 

最新更新