选择满足以下条件的行:在同一组中的变量的给定差内有其他行

  • 本文关键字:变量 一组 其他 条件 满足 选择 r
  • 更新时间 :
  • 英文 :


我有一个数据集,包含许多个人("ID"(,在跨越15年的随机时间点("时间"(进行体重测量("BW"(。

Example:
ID=c("1","1","1","1","1","1","2","2","2","2","3","3","3")
Time=c("2015/1/1","2015/3/1","2016/1/1","2016/3/1","2017/1/1","2018/5/1","2012/1/1","2017/5/1","2019/4/1","2020/4/1","2019/10/1","2020/1/1","2020/4/1")
BW=rnorm(13,mean=75)
df<-data.frame(ID,Time,BW)
ID      Time       BW
1   1  2015/1/1 75.01736
2   1  2015/3/1 75.44717
3   1  2016/1/1 73.09934
4   1  2016/3/1 74.79920
5   1  2017/1/1 74.70097
6   1  2018/5/1 74.23496
7   2  2012/1/1 73.57179
8   2  2017/5/1 74.50970
9   2  2019/4/1 74.43412
10  2  2020/4/1 75.02952
11  3 2019/10/1 76.41390
12  3  2020/1/1 75.79827
13  3  2020/4/1 74.46035

我试图过滤的是ID,其测量值在本次测量前12+/-3个月内有一个,在之后有一个。即0岁+/-3个月时一个体重,1岁时一个,2岁+/-3个月。在这种情况下,只有第3行到第5行符合标准。

并且在所有的";个人";满足这些标准,我想选择在+/-15个月范围内拥有最多数据点的测量。示例所需输出可能如下所示:


ID      Time       BW Fulfill Counts
1   1  2015/1/1 75.01736       0      4
2   1  2015/3/1 75.44717       0      4
3   1  2016/1/1 73.09934       1      5
4   1  2016/3/1 74.79920       1      5
5   1  2017/1/1 74.70097       1      3
6   1  2018/5/1 74.23496       0      2
7   2  2012/1/1 73.57179       0      1
8   2  2017/5/1 74.50970       0      1
9   2  2019/4/1 74.43412       0      2
10  2  2020/4/1 75.02952       0      2
11  3 2019/10/1 76.41390       0      3
12  3  2020/1/1 75.79827       0      3
13  3  2020/4/1 74.46035       0      3

我已经尽力在互联网上搜索类似的答案,但我找不到任何接近我想做的事情。我只能用进入分组部分

group_by(ID)%>%
mutate(Fulfill==if time-...)

然后将";计算与每隔一行的差";事情我想象的是,一个组(ID(中的每一行都有一个循环来计算时间差,然后是一个逻辑语句来确定它是否为真。我已经使用R有一段时间了,但以前只使用描述性统计数据,所以如果它真的很简单,我很抱歉。谢谢

这里有一种tidyverse方法(没有完全优化,您甚至可能将其简化为只有一个map_dfr左右的函数调用(。

我选择使用purrr::map_函数。这使我能够将该函数分别应用于列/向量的每个条目(这是第一次将Time传递给map_时的结果(,同时还传递完整的Time列(第二个参数(,以计算筛选操作,查看您在+-15个月内是否有条目。

ID=c("1","1","1","1","1","1","2","2","2","2","3","3","3")
Time=c("2015/1/1","2015/3/1","2016/1/1","2016/3/1","2017/1/1","2018/5/1","2012/1/1","2017/5/1","2019/4/1","2020/4/1","2019/10/1","2020/1/1","2020/4/1")
BW=rnorm(13,mean=75)
df<-data.frame(ID,Time,BW)
library(dplyr)
library(purrr)
library(lubridate)
check_entries <- function(curr_entry, entries) {
# establish bounds in which there must be entries
lower_bound_1 <- curr_entry %m-% months(15)
lower_bound_2 <- curr_entry %m-% months(9)
upper_bound_1 <- curr_entry %m+% months(9)
upper_bound_2 <- curr_entry %m+% months(15)

# filter the entries that match the time period constraints
entries <- data.frame(entries = entries)
filtered_lower <- entries %>% 
filter(entries >= lower_bound_1 & entries <= lower_bound_2)
filtered_upper <- entries %>% 
filter(entries >= upper_bound_1 & entries <= upper_bound_2)

# check if there is a matching earlier and later entry
if (nrow(filtered_lower) > 0 && nrow(filtered_upper) > 0) {
TRUE
} else {
FALSE
}
}
calculate_number_entries <- function(curr_entry, entries) {
# establish bounds in which there must be entries
lower_bound <- curr_entry %m-% months(15)
upper_bound <- curr_entry %m+% months(15)

# filter the matching entries and calculate the number of observations
entries <- data.frame(entries = entries)
entries %>% 
filter(entries >= lower_bound & entries <= upper_bound) %>% 
nrow()
}
df %>% 
group_by(ID) %>% 
mutate(Time = as.Date(Time, format = "%Y/%m/%d"),
Fulfill = map_lgl(Time, check_entries, Time),
Fulfill_ID = sum(Fulfill) > 0,
Counts = map_int(Time, calculate_number_entries, Time))
#> # A tibble: 13 x 6
#> # Groups:   ID [3]
#>    ID    Time          BW Fulfill Fulfill_ID Counts
#>    <chr> <date>     <dbl> <lgl>   <lgl>       <int>
#>  1 1     2015-01-01  75.4 FALSE   TRUE            4
#>  2 1     2015-03-01  74.0 FALSE   TRUE            4
#>  3 1     2016-01-01  74.2 TRUE    TRUE            5
#>  4 1     2016-03-01  74.9 TRUE    TRUE            5
#>  5 1     2017-01-01  75.6 FALSE   TRUE            3
#>  6 1     2018-05-01  73.8 FALSE   TRUE            1
#>  7 2     2012-01-01  75.6 FALSE   FALSE           1
#>  8 2     2017-05-01  75.0 FALSE   FALSE           1
#>  9 2     2019-04-01  74.3 FALSE   FALSE           2
#> 10 2     2020-04-01  74.9 FALSE   FALSE           2
#> 11 3     2019-10-01  75.5 FALSE   FALSE           3
#> 12 3     2020-01-01  75.3 FALSE   FALSE           3
#> 13 3     2020-04-01  76.0 FALSE   FALSE           3

由reprex包(v0.3.0(创建于2020-12-06

请注意,我在第5个条目中发现了不同的结果,您可以检查月份的加法/减法是否符合您的需要,查看lubridate了解更多信息。

最新更新