我有一个数据集,包含许多个人("ID"(,在跨越15年的随机时间点("时间"(进行体重测量("BW"(。
Example:
ID=c("1","1","1","1","1","1","2","2","2","2","3","3","3")
Time=c("2015/1/1","2015/3/1","2016/1/1","2016/3/1","2017/1/1","2018/5/1","2012/1/1","2017/5/1","2019/4/1","2020/4/1","2019/10/1","2020/1/1","2020/4/1")
BW=rnorm(13,mean=75)
df<-data.frame(ID,Time,BW)
ID Time BW
1 1 2015/1/1 75.01736
2 1 2015/3/1 75.44717
3 1 2016/1/1 73.09934
4 1 2016/3/1 74.79920
5 1 2017/1/1 74.70097
6 1 2018/5/1 74.23496
7 2 2012/1/1 73.57179
8 2 2017/5/1 74.50970
9 2 2019/4/1 74.43412
10 2 2020/4/1 75.02952
11 3 2019/10/1 76.41390
12 3 2020/1/1 75.79827
13 3 2020/4/1 74.46035
我试图过滤的是ID,其测量值在本次测量前12+/-3个月内有一个,在之后有一个。即0岁+/-3个月时一个体重,1岁时一个,2岁+/-3个月。在这种情况下,只有第3行到第5行符合标准。
并且在所有的";个人";满足这些标准,我想选择在+/-15个月范围内拥有最多数据点的测量。示例所需输出可能如下所示:
ID Time BW Fulfill Counts
1 1 2015/1/1 75.01736 0 4
2 1 2015/3/1 75.44717 0 4
3 1 2016/1/1 73.09934 1 5
4 1 2016/3/1 74.79920 1 5
5 1 2017/1/1 74.70097 1 3
6 1 2018/5/1 74.23496 0 2
7 2 2012/1/1 73.57179 0 1
8 2 2017/5/1 74.50970 0 1
9 2 2019/4/1 74.43412 0 2
10 2 2020/4/1 75.02952 0 2
11 3 2019/10/1 76.41390 0 3
12 3 2020/1/1 75.79827 0 3
13 3 2020/4/1 74.46035 0 3
我已经尽力在互联网上搜索类似的答案,但我找不到任何接近我想做的事情。我只能用进入分组部分
group_by(ID)%>%
mutate(Fulfill==if time-...)
然后将";计算与每隔一行的差";事情我想象的是,一个组(ID(中的每一行都有一个循环来计算时间差,然后是一个逻辑语句来确定它是否为真。我已经使用R有一段时间了,但以前只使用描述性统计数据,所以如果它真的很简单,我很抱歉。谢谢
这里有一种tidyverse
方法(没有完全优化,您甚至可能将其简化为只有一个map_dfr
左右的函数调用(。
我选择使用purrr::map_
函数。这使我能够将该函数分别应用于列/向量的每个条目(这是第一次将Time
传递给map_
时的结果(,同时还传递完整的Time
列(第二个参数(,以计算筛选操作,查看您在+-15个月内是否有条目。
ID=c("1","1","1","1","1","1","2","2","2","2","3","3","3")
Time=c("2015/1/1","2015/3/1","2016/1/1","2016/3/1","2017/1/1","2018/5/1","2012/1/1","2017/5/1","2019/4/1","2020/4/1","2019/10/1","2020/1/1","2020/4/1")
BW=rnorm(13,mean=75)
df<-data.frame(ID,Time,BW)
library(dplyr)
library(purrr)
library(lubridate)
check_entries <- function(curr_entry, entries) {
# establish bounds in which there must be entries
lower_bound_1 <- curr_entry %m-% months(15)
lower_bound_2 <- curr_entry %m-% months(9)
upper_bound_1 <- curr_entry %m+% months(9)
upper_bound_2 <- curr_entry %m+% months(15)
# filter the entries that match the time period constraints
entries <- data.frame(entries = entries)
filtered_lower <- entries %>%
filter(entries >= lower_bound_1 & entries <= lower_bound_2)
filtered_upper <- entries %>%
filter(entries >= upper_bound_1 & entries <= upper_bound_2)
# check if there is a matching earlier and later entry
if (nrow(filtered_lower) > 0 && nrow(filtered_upper) > 0) {
TRUE
} else {
FALSE
}
}
calculate_number_entries <- function(curr_entry, entries) {
# establish bounds in which there must be entries
lower_bound <- curr_entry %m-% months(15)
upper_bound <- curr_entry %m+% months(15)
# filter the matching entries and calculate the number of observations
entries <- data.frame(entries = entries)
entries %>%
filter(entries >= lower_bound & entries <= upper_bound) %>%
nrow()
}
df %>%
group_by(ID) %>%
mutate(Time = as.Date(Time, format = "%Y/%m/%d"),
Fulfill = map_lgl(Time, check_entries, Time),
Fulfill_ID = sum(Fulfill) > 0,
Counts = map_int(Time, calculate_number_entries, Time))
#> # A tibble: 13 x 6
#> # Groups: ID [3]
#> ID Time BW Fulfill Fulfill_ID Counts
#> <chr> <date> <dbl> <lgl> <lgl> <int>
#> 1 1 2015-01-01 75.4 FALSE TRUE 4
#> 2 1 2015-03-01 74.0 FALSE TRUE 4
#> 3 1 2016-01-01 74.2 TRUE TRUE 5
#> 4 1 2016-03-01 74.9 TRUE TRUE 5
#> 5 1 2017-01-01 75.6 FALSE TRUE 3
#> 6 1 2018-05-01 73.8 FALSE TRUE 1
#> 7 2 2012-01-01 75.6 FALSE FALSE 1
#> 8 2 2017-05-01 75.0 FALSE FALSE 1
#> 9 2 2019-04-01 74.3 FALSE FALSE 2
#> 10 2 2020-04-01 74.9 FALSE FALSE 2
#> 11 3 2019-10-01 75.5 FALSE FALSE 3
#> 12 3 2020-01-01 75.3 FALSE FALSE 3
#> 13 3 2020-04-01 76.0 FALSE FALSE 3
由reprex包(v0.3.0(创建于2020-12-06
请注意,我在第5个条目中发现了不同的结果,您可以检查月份的加法/减法是否符合您的需要,查看lubridate了解更多信息。