在r中每6小时过滤一次日期时间

我是R的新手，我正试图过滤我的数据集以避免自动相关。我的数据集由25只GPS羽衣草动物的50,000多个位置(经度和纬度)组成，带有日期时间戳(收购时间)和一些附加信息(年龄、性别、研究区域)。我需要为每个人($animals_id)过滤一组位置，仅包括获取时间间隔至少6小时的位置。我将首先按个人和收购时间分组数据，但我不知道如何编写过滤函数。

这是我的数据集的一个子集:

animals_id    acquisition_time longitude latitude      projection
8663         74 2018-02-17 03:00:24  6.426237 50.31815 EPSG:4326-WGS48
8664         74 2018-02-17 13:00:48  6.428196 50.31657 EPSG:4326-WGS48
8665         74 2018-02-17 18:00:24  6.423940 50.31833 EPSG:4326-WGS48
8666         74 2018-02-18 14:00:24  6.420372 50.31563 EPSG:4326-WGS48
8667         74 2018-02-18 19:00:54  6.420273 50.31534 EPSG:4326-WGS48
8668         74 2018-02-19 00:00:24  6.415756 50.31993 EPSG:4326-WGS48
8669         74 2018-02-19 20:00:24  6.415771 50.31927 EPSG:4326-WGS48
8670         78 2017-05-01 01:00:08  6.337308 50.26133 EPSG:4326-WGS48
8671         78 2017-05-01 06:00:23  6.345836 50.25292 EPSG:4326-WGS48
8672         78 2017-05-01 11:00:41  6.345818 50.25295 EPSG:4326-WGS48
8673         78 2017-05-01 16:00:23  6.345813 50.25287 EPSG:4326-WGS48
8674         78 2017-05-01 21:00:12  6.343215 50.25456 EPSG:4326-WGS48
8675         78 2017-05-02 02:00:23  6.342139 50.25576 EPSG:4326-WGS48
8676         78 2017-05-02 07:00:47  6.352676 50.25308 EPSG:4326-WGS48
collar_type study_area_id animals_age_class animals_sex
8663         gps            15                 a           f
8664         gps            15                 a           f
8665         gps            15                 a           f
8666         gps            15                 a           f
8667         gps            15                 a           f
8668         gps            15                 a           f
8669         gps            15                 a           f
8670         gps            15                 a           f
8671         gps            15                 a           f
8672         gps            15                 a           f
8673         gps            15                 a           f
8674         gps            15                 a           f
8675         gps            15                 a           f
8676         gps            15                 a           f
>

我的代码:

data$acquisition_time = as.POSIXct(data$acquisition_time, tz = "UTC", format = "%Y-%m-%d %H:%M:%S")
filtered <- data %>% group_by(animals_id,acquisition_time) %>% filter()

谢谢你的建议。

快速查看数据和采集时间间隔:

animals_id    acquisition_time longitude latitude     hours
8663         74 2018-02-17 03:00:24  6.426237 50.31815  0.000000
8664         74 2018-02-17 13:00:48  6.428196 50.31657 10.006667
8665         74 2018-02-17 18:00:24  6.423940 50.31833  4.993333
8666         74 2018-02-18 14:00:24  6.420372 50.31563 20.000000
8667         74 2018-02-18 19:00:54  6.420273 50.31534  5.008333
8668         74 2018-02-19 00:00:24  6.415756 50.31993  4.991667
8669         74 2018-02-19 20:00:24  6.415771 50.31927 20.000000
8670         78 2017-05-01 01:00:08  6.337308 50.26133  0.000000
8671         78 2017-05-01 06:00:23  6.345836 50.25292  5.004167
8672         78 2017-05-01 11:00:41  6.345818 50.25295  5.005000
8673         78 2017-05-01 16:00:23  6.345813 50.25287  4.995000
8674         78 2017-05-01 21:00:12  6.343215 50.25456  4.996944
8675         78 2017-05-02 02:00:23  6.342139 50.25576  5.003056
8676         78 2017-05-02 07:00:47  6.352676 50.25308  5.006667

对我来说，这意味着对于id74，我们将删除8665和8667行;对于id78，我们将删除8671、8673和8675行。根据animals_id，这样做将导致所有观测间隔不少于6小时。

基地R

func <- function(z, period = 6*3600) {
if (length(z) < 2) return(rep(TRUE, length(z)))
out <- TRUE
ind <- 1
while (ind < length(z)) {
found <- which( (z[-seq_len(ind)] - z[ind]) >= period )
if (!length(found)) {
out <- c(out, rep(FALSE, length(z) - length(out)))
break
}
out <- c(out, rep(FALSE, found[1] - 1), TRUE)
ind <- ind + found[1]
}
out
}
dat[ave(as.numeric(dat$acquisition_time, units = "sec"), dat$animals_id, FUN = func) > 0,]
#      animals_id    acquisition_time longitude latitude
# 8663         74 2018-02-17 03:00:24  6.426237 50.31815
# 8664         74 2018-02-17 13:00:48  6.428196 50.31657
# 8666         74 2018-02-18 14:00:24  6.420372 50.31563
# 8668         74 2018-02-19 00:00:24  6.415756 50.31993
# 8669         74 2018-02-19 20:00:24  6.415771 50.31927
# 8670         78 2017-05-01 01:00:08  6.337308 50.26133
# 8672         78 2017-05-01 11:00:41  6.345818 50.25295
# 8674         78 2017-05-01 21:00:12  6.343215 50.25456
# 8676         78 2017-05-02 07:00:47  6.352676 50.25308

(注意:基数R的ave有一个主要限制，即提供的FUN动作的返回值必须与输入向量的类相同;当输入为POSIXt时，这会导致一些问题。为了减轻这些问题，我先发制人地将调用ave的时间临时转换为numeric。在base R中，并不是所有的组汇总函数都需要这样做，只有ave是这样，尽管它是最适合于此目的的。

dplyr

library(dplyr)
dat %>%
group_by(animals_id) %>%
filter(func(acquisition_time)) %>%
# not necessary, just here to show the resulting hours-between-times
mutate(hours = c(0, diff(acquisition_time, units = "hours"))) %>%
ungroup()
# # A tibble: 9 x 5
#   animals_id acquisition_time    longitude latitude hours
#        <int> <dttm>                  <dbl>    <dbl> <dbl>
# 1         74 2018-02-17 03:00:24      6.43     50.3  0   
# 2         74 2018-02-17 13:00:48      6.43     50.3 10.0 
# 3         74 2018-02-18 14:00:24      6.42     50.3 25.0 
# 4         74 2018-02-19 00:00:24      6.42     50.3 10   
# 5         74 2018-02-19 20:00:24      6.42     50.3 20   
# 6         78 2017-05-01 01:00:08      6.34     50.3  0   
# 7         78 2017-05-01 11:00:41      6.35     50.3 10.0 
# 8         78 2017-05-01 21:00:12      6.34     50.3  9.99
# 9         78 2017-05-02 07:00:47      6.35     50.3 10.0

(注意dplyr删除行名。我添加hours列只是为了演示产生的时间差，在生产中不需要。)

数据:为了简化/MWE，我只使用了上面数据中的前四个。

dat <- structure(list(animals_id = c(74L, 74L, 74L, 74L, 74L, 74L, 74L, 78L, 78L, 78L, 78L, 78L, 78L, 78L), acquisition_time = structure(c(1518836424, 1518872448, 1518890424, 1518962424, 1518980454, 1518998424, 1519070424, 1493600408, 1493618423, 1493636441, 1493654423, 1493672412, 1493690423, 1493708447), class = c("POSIXct", "POSIXt"), tzone = "UTC"), longitude = c(6.426237, 6.428196, 6.42394, 6.420372, 6.420273, 6.415756, 6.415771, 6.337308, 6.345836, 6.345818, 6.345813, 6.343215, 6.342139, 6.352676), latitude = c(50.31815, 50.31657, 50.31833, 50.31563, 50.31534, 50.31993, 50.31927, 50.26133, 50.25292, 50.25295, 50.25287, 50.25456, 50.25576, 50.25308 )), row.names = c("8663", "8664", "8665", "8666", "8667", "8668", "8669", "8670", "8671", "8672", "8673", "8674", "8675", "8676"), class = "data.frame")

基地R

dplyr

相关内容

最新更新

热门标签：