我是R的新手,我正试图过滤我的数据集以避免自动相关。我的数据集由25只GPS羽衣草动物的50,000多个位置(经度和纬度)组成,带有日期时间戳(收购时间)和一些附加信息(年龄、性别、研究区域)。我需要为每个人($animals_id)过滤一组位置,仅包括获取时间间隔至少6小时的位置。我将首先按个人和收购时间分组数据,但我不知道如何编写过滤函数。
这是我的数据集的一个子集:
animals_id acquisition_time longitude latitude projection
8663 74 2018-02-17 03:00:24 6.426237 50.31815 EPSG:4326-WGS48
8664 74 2018-02-17 13:00:48 6.428196 50.31657 EPSG:4326-WGS48
8665 74 2018-02-17 18:00:24 6.423940 50.31833 EPSG:4326-WGS48
8666 74 2018-02-18 14:00:24 6.420372 50.31563 EPSG:4326-WGS48
8667 74 2018-02-18 19:00:54 6.420273 50.31534 EPSG:4326-WGS48
8668 74 2018-02-19 00:00:24 6.415756 50.31993 EPSG:4326-WGS48
8669 74 2018-02-19 20:00:24 6.415771 50.31927 EPSG:4326-WGS48
8670 78 2017-05-01 01:00:08 6.337308 50.26133 EPSG:4326-WGS48
8671 78 2017-05-01 06:00:23 6.345836 50.25292 EPSG:4326-WGS48
8672 78 2017-05-01 11:00:41 6.345818 50.25295 EPSG:4326-WGS48
8673 78 2017-05-01 16:00:23 6.345813 50.25287 EPSG:4326-WGS48
8674 78 2017-05-01 21:00:12 6.343215 50.25456 EPSG:4326-WGS48
8675 78 2017-05-02 02:00:23 6.342139 50.25576 EPSG:4326-WGS48
8676 78 2017-05-02 07:00:47 6.352676 50.25308 EPSG:4326-WGS48
collar_type study_area_id animals_age_class animals_sex
8663 gps 15 a f
8664 gps 15 a f
8665 gps 15 a f
8666 gps 15 a f
8667 gps 15 a f
8668 gps 15 a f
8669 gps 15 a f
8670 gps 15 a f
8671 gps 15 a f
8672 gps 15 a f
8673 gps 15 a f
8674 gps 15 a f
8675 gps 15 a f
8676 gps 15 a f
>
我的代码:
data$acquisition_time = as.POSIXct(data$acquisition_time, tz = "UTC", format = "%Y-%m-%d %H:%M:%S")
filtered <- data %>% group_by(animals_id,acquisition_time) %>% filter()
谢谢你的建议。
快速查看数据和采集时间间隔:
animals_id acquisition_time longitude latitude hours
8663 74 2018-02-17 03:00:24 6.426237 50.31815 0.000000
8664 74 2018-02-17 13:00:48 6.428196 50.31657 10.006667
8665 74 2018-02-17 18:00:24 6.423940 50.31833 4.993333
8666 74 2018-02-18 14:00:24 6.420372 50.31563 20.000000
8667 74 2018-02-18 19:00:54 6.420273 50.31534 5.008333
8668 74 2018-02-19 00:00:24 6.415756 50.31993 4.991667
8669 74 2018-02-19 20:00:24 6.415771 50.31927 20.000000
8670 78 2017-05-01 01:00:08 6.337308 50.26133 0.000000
8671 78 2017-05-01 06:00:23 6.345836 50.25292 5.004167
8672 78 2017-05-01 11:00:41 6.345818 50.25295 5.005000
8673 78 2017-05-01 16:00:23 6.345813 50.25287 4.995000
8674 78 2017-05-01 21:00:12 6.343215 50.25456 4.996944
8675 78 2017-05-02 02:00:23 6.342139 50.25576 5.003056
8676 78 2017-05-02 07:00:47 6.352676 50.25308 5.006667
对我来说,这意味着对于id74
,我们将删除8665和8667行;对于id78
,我们将删除8671、8673和8675行。根据animals_id
,这样做将导致所有观测间隔不少于6小时。
基地R
func <- function(z, period = 6*3600) {
if (length(z) < 2) return(rep(TRUE, length(z)))
out <- TRUE
ind <- 1
while (ind < length(z)) {
found <- which( (z[-seq_len(ind)] - z[ind]) >= period )
if (!length(found)) {
out <- c(out, rep(FALSE, length(z) - length(out)))
break
}
out <- c(out, rep(FALSE, found[1] - 1), TRUE)
ind <- ind + found[1]
}
out
}
dat[ave(as.numeric(dat$acquisition_time, units = "sec"), dat$animals_id, FUN = func) > 0,]
# animals_id acquisition_time longitude latitude
# 8663 74 2018-02-17 03:00:24 6.426237 50.31815
# 8664 74 2018-02-17 13:00:48 6.428196 50.31657
# 8666 74 2018-02-18 14:00:24 6.420372 50.31563
# 8668 74 2018-02-19 00:00:24 6.415756 50.31993
# 8669 74 2018-02-19 20:00:24 6.415771 50.31927
# 8670 78 2017-05-01 01:00:08 6.337308 50.26133
# 8672 78 2017-05-01 11:00:41 6.345818 50.25295
# 8674 78 2017-05-01 21:00:12 6.343215 50.25456
# 8676 78 2017-05-02 07:00:47 6.352676 50.25308
(注意:基数R的ave
有一个主要限制,即提供的FUN
动作的返回值必须与输入向量的类相同;当输入为POSIXt
时,这会导致一些问题。为了减轻这些问题,我先发制人地将调用ave
的时间临时转换为numeric
。在base R中,并不是所有的组汇总函数都需要这样做,只有ave
是这样,尽管它是最适合于此目的的。
dplyr
library(dplyr)
dat %>%
group_by(animals_id) %>%
filter(func(acquisition_time)) %>%
# not necessary, just here to show the resulting hours-between-times
mutate(hours = c(0, diff(acquisition_time, units = "hours"))) %>%
ungroup()
# # A tibble: 9 x 5
# animals_id acquisition_time longitude latitude hours
# <int> <dttm> <dbl> <dbl> <dbl>
# 1 74 2018-02-17 03:00:24 6.43 50.3 0
# 2 74 2018-02-17 13:00:48 6.43 50.3 10.0
# 3 74 2018-02-18 14:00:24 6.42 50.3 25.0
# 4 74 2018-02-19 00:00:24 6.42 50.3 10
# 5 74 2018-02-19 20:00:24 6.42 50.3 20
# 6 78 2017-05-01 01:00:08 6.34 50.3 0
# 7 78 2017-05-01 11:00:41 6.35 50.3 10.0
# 8 78 2017-05-01 21:00:12 6.34 50.3 9.99
# 9 78 2017-05-02 07:00:47 6.35 50.3 10.0
(注意dplyr
删除行名。我添加hours
列只是为了演示产生的时间差,在生产中不需要。)
数据:为了简化/MWE,我只使用了上面数据中的前四个。
dat <- structure(list(animals_id = c(74L, 74L, 74L, 74L, 74L, 74L, 74L, 78L, 78L, 78L, 78L, 78L, 78L, 78L), acquisition_time = structure(c(1518836424, 1518872448, 1518890424, 1518962424, 1518980454, 1518998424, 1519070424, 1493600408, 1493618423, 1493636441, 1493654423, 1493672412, 1493690423, 1493708447), class = c("POSIXct", "POSIXt"), tzone = "UTC"), longitude = c(6.426237, 6.428196, 6.42394, 6.420372, 6.420273, 6.415756, 6.415771, 6.337308, 6.345836, 6.345818, 6.345813, 6.343215, 6.342139, 6.352676), latitude = c(50.31815, 50.31657, 50.31833, 50.31563, 50.31534, 50.31993, 50.31927, 50.26133, 50.25292, 50.25295, 50.25287, 50.25456, 50.25576, 50.25308 )), row.names = c("8663", "8664", "8665", "8666", "8667", "8668", "8669", "8670", "8671", "8672", "8673", "8674", "8675", "8676"), class = "data.frame")