我有位置数据,下面显示了一个例子,其中time
是记录每个位置的时间,ref
是每个点的参考,x
是每个点的x坐标,y
是每个点的y坐标。
> print(df)
time ref x y
1 1 1 92.80 49.58
2 1 2 90.20 96.02
3 1 3 91.61 80.05
4 1 4 68.75 20.56
5 1 5 5.53 35.27
6 1 6 39.85 85.39
7 1 7 12.04 87.43
8 1 8 42.98 56.53
9 1 9 19.14 63.56
10 1 10 25.72 7.62
11 2 1 50.39 7.16
12 2 2 17.71 7.15
13 2 3 52.96 34.87
14 2 4 52.70 97.07
15 2 5 70.88 44.88
16 2 6 32.12 71.82
17 2 7 24.15 22.77
18 2 8 18.06 31.03
19 2 9 70.55 92.42
20 2 10 45.05 79.67
我想采取的步骤如下(步骤1至4成功完成(
- 多次复制 x 和 y 坐标,误差很小
- 计算每个时刻每个点之间的距离
- 计算每个时间实例的这 45 个距离的总和
- 在步骤 1 中创建的所有不同迭代中重复此过程
- 创建一个包含所有这些信息的新数据帧
第 1 步。
set.seed(456) #set seed to get consistent results
n <- 3 # this is 3 for this example but would likely be 1000 or 10000 and refers to the number of simulations
for(i in seq(5,(2*n+3),2)){ #create simulations of the xy data set
df[,i] = df[,3] + rnorm(length(df[,2]),0,1) #replicates the x column
df[,i+1] = df[,4] + rnorm(length(df[,3]),0,1) # replicates the y column
}
这段代码有效且易于调整,并给了我以下 df。前 4 列与上面完全相同。V5 和 V6 是 n=1 的 x 和 y 坐标,与原始 x 和 y 有很小的误差(你可以看到这些值有多相似( V7 和 V8 是 x 和 y 表示 n=2,V9 和 V10 是 x 和 y 表示 n=3
print(df)
time ref x y V5 V6 V7 V8 V9 V10
1 1 1 92.80 49.58 91.456479 49.105396 92.771058 47.325290 91.720518 49.698151
2 1 2 90.20 96.02 90.821776 94.302691 90.593037 95.037940 89.758626 96.889903
3 1 3 91.61 80.05 92.410875 78.623170 91.360386 79.849432 93.630635 79.958064
4 1 4 68.75 20.56 67.361108 20.768236 68.833450 21.455930 68.822856 20.628899
5 1 5 5.53 35.27 4.815643 35.234164 7.608875 35.226455 6.238817 33.587573
6 1 6 39.85 85.39 39.525939 86.524285 39.970852 87.037308 40.700509 86.506956
7 1 7 12.04 87.43 12.730643 86.967145 12.158149 88.993299 10.553803 86.078642
8 1 8 42.98 56.53 43.230548 56.201616 43.750054 55.098622 43.900530 55.992833
9 1 9 19.14 63.56 20.147352 65.044539 17.964598 63.015406 19.288329 63.189886
10 1 10 25.72 7.62 26.293235 6.530622 26.129039 6.848746 25.483132 7.974012
11 2 1 50.39 7.16 49.474189 6.631206 49.725049 6.990012 49.916764 6.350175
12 2 2 17.71 7.15 19.021097 6.556207 17.453475 7.109238 17.040794 6.970275
13 2 3 52.96 34.87 53.948726 32.871084 53.638782 33.149460 54.318527 33.722340
14 2 4 52.70 97.07 54.353929 97.366153 53.596845 98.514106 54.112918 97.166242
15 2 5 70.88 44.88 69.439195 45.050625 71.498356 44.859985 70.147226 45.694700
16 2 6 32.12 71.82 34.067356 73.635652 32.851454 72.090232 32.039448 72.802941
17 2 7 24.15 22.77 25.886936 22.109397 23.736825 22.657066 24.960197 23.620843
18 2 8 18.06 31.03 18.447483 30.889748 19.617813 30.175112 18.562588 32.237347
19 2 9 70.55 92.42 72.830034 91.996021 71.091699 91.386259 71.674023 90.986222
20 2 10 45.05 79.67 46.587883 79.631264 45.627150 79.892027 44.878720 78.569054
步骤 2
我使用 dplyr 创建了代码,该代码按时间对数据进行分组,然后计算每个参考点之间的距离(此代码显示在步骤 3 中(。 有 10 个参考点,需要计算 45 个距离(10 个选择 2(。
第 3 步对于每组时间,我想计算所有 45 个距离的总和。 步骤 2 和 3 在以下代码中,该代码已制成函数
sumdist = function(data) {
names(data)[3]<-paste("x") #renames 3rd column x to assist for loop
names(data)[4]<-paste("y") #renames 4th column y to assist for loop
data = data %>%
group_by(time) %>%
mutate(dist1 = sqrt((x[which(ref == 1)] - x)^2 + (y[which(ref == 1)] - y)^2)) %>% #distance beween all points and point 1
mutate(dist2 = sqrt((x[which(ref == 2)] - x)^2 + (y[which(ref == 2)] - y)^2)) %>% #distance beween all points and point 2
mutate(dist3 = sqrt((x[which(ref == 3)] - x)^2 + (y[which(ref == 3)] - y)^2)) %>% #distance beween all points and point 3
mutate(dist4 = sqrt((x[which(ref == 4)] - x)^2 + (y[which(ref == 4)] - y)^2)) %>% #distance beween all points and point 4
mutate(dist5 = sqrt((x[which(ref == 5)] - x)^2 + (y[which(ref == 5)] - y)^2)) %>% #distance beween all points and point 5
mutate(dist6 = sqrt((x[which(ref == 6)] - x)^2 + (y[which(ref == 6)] - y)^2)) %>% #distance beween all points and point 6
mutate(dist7 = sqrt((x[which(ref == 7)] - x)^2 + (y[which(ref == 7)] - y)^2)) %>% #distance beween all points and point 7
mutate(dist8 = sqrt((x[which(ref == 8)] - x)^2 + (y[which(ref == 8)] - y)^2)) %>% #distance beween all points and point 8
mutate(dist9 = sqrt((x[which(ref == 9)] - x)^2 + (y[which(ref == 9)] - y)^2)) %>% #distance beween all points and point 9
mutate(dist10 = sqrt((x[which(ref == 10)] - x)^2 + (y[which(ref == 10)] - y)^2)) %>% #distance beween all points and point 10
summarise(sumdistances = (sum(dist1,dist2,dist3,dist4,dist5,dist6,dist7,dist8,dist9,dist10))/2) #sum of all distances
print(data$sumdistances)
}
在我的 DF 上运行此函数时,它仅使用第一个 x 和 y 进行计算,但它可以工作。 生成长度为 2 的向量。 第一个值用于时间 1,第二个值用于时间 2
> sumdist(df) # this calculates it from the original x and y
[1] 2706.592 2275.045
步骤 4
我现在想在我之前创建的多次迭代中重复此操作。对于我的实际数据集,n 将达到数千个,所以我需要自动化此过程
sumd = matrix(NA, nrow=2, ncol=n+1) # set collection matrix for nrow = number of time and #ncol = number simulations
for(i in 1:(n+1)) {
datas = df[,c(1,2,((1+2*i)),(2+(2*i))),] # extracts the time, ref along with x and y for each simulations
sumd[i] = sumdist(datas) # runs function on each simulated data set
}
因为我的函数在最后打印计算的数据,所以运行代码表明它已经计算了我想要的东西
> for(i in 1:(n+1)) {
+ datas = df[,c(1,2,((1+2*i)),(2+(2*i))),] # extracts the time, ref along with x and y for each simulations
+ sumd[i] = sumdist(datas) # runs function on each simulated data set
+ }
[1] 2706.592 2275.045
[1] 2695.796 2282.284
[1] 2713.277 2288.517
[1] 2719.587 2273.316
底部的 4 行是我想计算的,尽管不是完全按照这个顺序
理想情况下,它应该看起来更像这样
time V2 V3 V4 V5
1 1 2706.592 2695.796 2713.277 2719.587
2 2 2275.045 2282.284 2288.517 2273.316
步骤 5
但是我的一半矩阵仍然包含 NA,并且填充如下:
> print(sumd)
[,1] [,2] [,3] [,4]
[1,] 2706.592 2713.277 NA NA
[2,] 2695.796 2719.587 NA NA
我收到的错误是这样的
Warning messages:
1: In sumd[i] <- sumdist(datas) :
number of items to replace is not a multiple of replacement length
2: In sumd[i] <- sumdist(datas) :
number of items to replace is not a multiple of replacement length
3: In sumd[i] <- sumdist(datas) :
number of items to replace is not a multiple of replacement length
4: In sumd[i] <- sumdist(datas) :
number of items to replace is not a multiple of replacement length
这似乎直截了当地说明了出了什么问题。 我创建的矩阵不适合输出。我尝试以多种方式更改矩阵以使其适合,但是我一直收到错误,最终似乎无法获得包含我想要的信息的矩阵或数据帧。
编辑 - 我现在了解了初始代码中的错误,该错误阻止了它的工作,这自然非常简单。sumd[i]
应改为sumd[,i]
好的,在您编辑后,我意识到我误解了您的问题。
我认为您的设计的问题在于您想提前创建列。显然,它们不能有一个正确的名称,这使得识别x和y有点困难。
这是我的建议:添加高斯噪声并动态计算总和。
首先,让我们重新创建数据帧(下次可以共享此代码或一些dput
输出,这样可以更轻松地提供帮助(。
library(tidyverse)
df = read.table(header=TRUE, text="
time ref x y
1 1 1 92.80 49.58
2 1 2 90.20 96.02
3 1 3 91.61 80.05
4 1 4 68.75 20.56
5 1 5 5.53 35.27
6 1 6 39.85 85.39
7 1 7 12.04 87.43
8 1 8 42.98 56.53
9 1 9 19.14 63.56
10 1 10 25.72 7.62
11 2 1 50.39 7.16
12 2 2 17.71 7.15
13 2 3 52.96 34.87
14 2 4 52.70 97.07
15 2 5 70.88 44.88
16 2 6 32.12 71.82
17 2 7 24.15 22.77
18 2 8 18.06 31.03
19 2 9 70.55 92.42
20 2 10 45.05 79.67")
然后,让我们重写距离计算,因为我发现您的代码有点多余。编程经验法则:干。如果你重复一个结构超过 3 次,你可能应该写一些函数。
options(dplyr.summarise.inform=FALSE) #don't care about those warnings
distance = function(x1,x2,y1,y2) sqrt(((x2-x1)^2)+((y2-y1)^2))
distance2 = function(x,y,.pred) distance(x, x[.pred], y, y[.pred])
distance_sum = function(x, y, ref){
dists = map(1:10, ~distance2(x,y, which(ref == .x)))
sum(unlist(dists))/2
}
在这里,我可以在x
和y
上重现您的结果:
df %>%
group_by(time) %>%
summarise(sum=distance_sum(x, y, ref))
#> # A tibble: 2 x 2
#> time sum
#> <int> <dbl>
#> 1 1 2707.
#> 2 2 2275.
最后,我们可以复制一定次数,事先添加一些随机噪声。同样,结果值与您的值相同。
set.seed(456)
n <- 3 #or 10000
xx = rerun(n, {
df %>%
mutate(x=x+rnorm(length(x),0,1),
y=y+rnorm(length(y),0,1)) %>%
group_by(time) %>%
summarise(sum=distance_sum(x, y, ref)) %>%
as.data.frame() #needed for the precision in the example, you can drop this line
})
xx
#> [[1]]
#> time sum
#> 1 1 2695.796
#> 2 2 2282.284
#>
#> [[2]]
#> time sum
#> 1 1 2713.277
#> 2 2 2288.517
#>
#> [[3]]
#> time sum
#> 1 1 2719.587
#> 2 2 2273.316
然后,您可以rbind
列表并计算其上的一些统计信息:
xx %>% #this was run with n=25
reduce(rbind) %>%
group_by(time) %>%
summarise(sum_m=mean(sum), sum_sd=sd(sum))
#> # A tibble: 2 x 3
#> time sum_m sum_sd
#> <int> <dbl> <dbl>
#> 1 1 2711. 22.2
#> 2 2 2280. 16.8
Created on 2020-06-18 by the reprex package (v0.3.0)
df <- tibble(
ref = rep(c(1, 2, 3), each = 5),
x = rnorm(15, 10, 8),
y = rnorm(15, 35, 20)
)
# Number of created points
n <- 3
# Putting x and y as point
df <- df %>%
mutate(point = map2(x, y, c))
# Adding noise to point
new_points <- seq_len(n)
names(new_points) <- new_points %>% str_c("point_", .)
new_cols <- new_points %>%
map(~list(rnorm(15), rnorm(15)) %>% transpose() %>% map(unlist)) %>%
map(~map2(.x, df$point, ~.x+.y)) %>%
as_tibble()
# Binding new points
df <- df %>%
bind_cols(new_cols)
# Functions for calculating euclidian distance of point list
dList <- function(a, b)
b %>%
map_dbl(~(a - .x)^2 %>% sum() %>% sqrt())
sumDistanceList <- function(l)
seq_len(length(l) - 1) %>%
map(~dList(l[[.x]], l[(.x+1):length(l)])) %>%
unlist() %>%
sum()
# Summarise
df %>%
group_by(ref) %>%
summarise(across(str_subset(names(.), "point_"), sumDistanceList))