GR1 A1 2022-01-01 2022-02-12 GR2 A2 2022-01-02
可再现数据:
df1 <- tibble(id = c("GR1","GR2"),
area = c("A1","A2"),
date1 = as.Date(c("2022-01-01","2022-01-02")),
date2 = as.Date(c("2022-01-06","2022-01-08")))
set.seed(543)
df2 <- tibble(date3 = seq(as.Date("2022-01-01"), as.Date("2022-01-09"), "days"),
temperature =runif(9, min = 28, max = 33),
area = c("A1","A2","A1","A2","A1","A2","A1","A2","A1"))
你好,我想在df1中创建一个列,用平均温度从df2生成一个过滤器。(在实际数据帧中,我在df1中有1036行,在df2中有26192行。(
我尝试过这种方法,但它不像我想象的那样有效
df3 <- df1 %>%
group_by(area) %>%
mutate(average_temp = mean(filter(.data = df2, date3 >= df1$date1 & date3 <= df1$date2 & area == df1$area)$temperature))
我得到这个错误
警告信息:
1:计算average_temp = mean(...)
时出现问题。
i较长的对象长度不是较短对象长度的倍数
预期结果为
区域这是一个非等或基于范围的联接。不幸的是,dplyr
本身无法做到这一点,因此我们需要另一个包的帮助。以下选项:
模糊联接
fuzzyjoin::fuzzy_left_join(
df1, df2,
by = c("area", date1="date3", date2="date3"),
match_fun=list(`==`, `<=`, `>=`)
) %>%
group_by(id, date1, date2) %>%
summarize(
area = area.x[1],
avg = mean(temperature)
) %>%
ungroup()
# `summarise()` has grouped output by 'id', 'date1'. You can override using the `.groups` argument.
# # A tibble: 2 x 5
# id date1 date2 area avg
# <chr> <date> <date> <chr> <dbl>
# 1 GR1 2022-01-01 2022-01-06 A1 31.6
# 2 GR2 2022-01-02 2022-01-08 A2 30.5
数据表
library(data.table)
DT1 <- as.data.table(df1)
DT2 <- as.data.table(df2)
DT1[DT2, avg := ave(i.temperature, id, FUN = mean),
on = .(area, date1 <= date3, date2 >= date3) ]
# id area date1 date2 avg
# <char> <char> <Date> <Date> <num>
# 1: GR1 A1 2022-01-01 2022-01-06 31.58708
# 2: GR2 A2 2022-01-02 2022-01-08 30.50867
(我知道有一种更规范的方法可以在没有ave
的情况下做到这一点,但我没有时间了…(
sqldf
# library(sqldf) # not required to load, per se
sqldf::sqldf(
"select df1.id, df1.area, df1.date1, df1.date2,
avg(df2.temperature) as avg
from df1
left join df2 on df1.area=df2.area
and df2.date3 between df1.date1 and df1.date2
group by df1.id, df1.area, df1.date1, df1.date2")
# id area date1 date2 avg
# 1 GR1 A1 2022-01-01 2022-01-06 31.58708
# 2 GR2 A2 2022-01-02 2022-01-08 30.50867