r-为出现特定于组的最大值创建虚设，以总最大值为条件

假设我有一个类似的数据集：

library(tidyverse)
library(lubridate)
state <- c(rep("Alabama", 10), rep("Arizona", 10), rep("Arkansas", 10))
county <- c(rep("Baldwin", 5), rep("Barbour", 5), rep("Apache", 5), rep("Cochise", 5), rep("Arkansas", 5), rep("Ashley", 5))
date <- rep(seq(ymd('2012-04-06'),ymd('2012-04-10'),by='days'), 6)
stray_dogs <- c(lag(1:3, n = 2, default = 0), floor(runif(7, min=1, max=4)),
lag(1:6, n = 5, default = 0), floor(runif(4, min=1, max=18)),
lag(1:2, n = 1, default = 0), floor(runif(8, min=1, max=4)))
df <- data.frame(state, county, date, stray_dogs) %>% 
mutate(stray_dogs_max = max(stray_dogs)) %>% 
mutate(most_stray_dogs = case_when(stray_dogs_max == stray_dogs ~ 1,
stray_dogs_max != stray_dogs ~ 0))

我想通过group_by(state, county)或类似的方式找到每个县发现流浪狗数量最多的日期，并创建一个二分变量(列(，该变量取该特定日期的1值(其余日期取0(。然而，当某个县在这段时间内根本没有流浪狗时，当most_stray_dogs等于1时，应将这一天标记为1；当一个县内有多个流浪狗数量相同的日子时，它应该选择更接近most_stray_dogs == 1的日子。

对于后一点，我的直觉是使用difftime创建的辅助向量；尽管如此，我还是不能同时把这些都放在一起。我应该如何创建此列？

我认为这是可行的。否"；正确答案"；提供了，数据足够大，很难吸引眼球，所以我不乐观，但它是有条理的，所以它至少应该让你走上正轨。

在计算数据差异时，我任意减去0.1，作为全国最大值前后相等天数之间的平局决胜局。然后每组arrange来分配最佳选择(这有点低效，但应该足够快(。

df %>% arrange(state, county, date) %>%
group_by(date) %>%
mutate(national_count = sum(stray_dogs)) %>%
ungroup() %>%
mutate(
is_national_max = national_count == max(national_count)
) %>%
group_by(state, county) %>%
mutate(
is_county_max = stray_dogs == max(stray_dogs),
days_from_national_max = abs(date - date[is_national_max] - 0.1)
)  %>%
arrange(state, county, desc(is_county_max), desc(days_from_national_max)) %>%
mutate(your_result = as.integer(row_number() == 1)) %>%
ungroup() %>%
arrange(state, county, date)

相关内容

最新更新

热门标签：