我今天花了几个小时来找到一个解决方案,有类似的线程在那里,但不完全是我需要的。
数据集:
Year <- c(2019, 2020, 2021, 2019, 2020, 2020, 2021, 2021)
Term <- c("2019_T1", "2020_T1", "2021_T1", "2019_T1", "2020_T1", "2020_T2", "2021_T1", "2021_T2")
Code <- c(1,1,1,2,2,2,2,2)
Description <- c("Desc1","Desc1","Desc1", "Desc2", "Desc2", "Desc2", "Desc2_NotRecent","Desc2_Recent")
生成如下表:
Year Term Code Description
1 2019 2019_T1 1 Desc1
2 2020 2020_T1 1 Desc1
3 2021 2021_T1 1 Desc1
4 2019 2019_T1 2 Desc2
5 2020 2020_T1 2 Desc2
6 2020 2020_T2 2 Desc2
7 2021 2021_T1 2 Desc2_NotRecent
8 2021 2021_T2 2 Desc2_Recent
问题:如何添加一列来显示每个代码的最新描述。
我需要根据Term找到最近的。也许这可以通过一个简单的排序来完成,对不起,我没有弄清楚这个。
重要的是它是最近的Term值。这里,最近的Term是2021_T2。如果第一个值被选中,它可能是一个旧的描述,并混淆涉众。
我需要的结果:
Year Term Code Description Most_Recent
1 2019 2019_T1 1 Desc1 Desc1
2 2020 2020_T1 1 Desc1 Desc1
3 2021 2021_T1 1 Desc1 Desc1
4 2019 2019_T1 2 Desc2 Desc2_Recent
5 2020 2020_T1 2 Desc2 Desc2_Recent
6 2020 2020_T2 2 Desc2 Desc2_Recent
7 2021 2021_T1 2 Desc2_NotRecent Desc2_Recent
8 2021 2021_T2 2 Desc2_Recent Desc2_Recent
非常感谢所有的帮助。编辑后包括Robin Gertenbach的简单解决方案。
df %>%
group_by(Code) %>%
dplyr:: mutate(Most_Recent = dplyr::last(Description, Term))
更短的解决方案,没有连接或条件
df %>%
group_by(Code) %>%
mutate(Most_Recent = last(Description, Term))
library(tidyverse)
Year <- c(2019, 2020, 2021, 2019, 2020, 2020, 2021)
Term <- c("2019_T1", "2020_T1", "2021_T1", "2019_T1", "2020_T1", "2020_T2", "2021_T2")
Code <- c(1, 1, 1, 2, 2, 2, 2)
Description <- c("Desc1", "Desc1", "Desc1", "Desc2", "Desc2", "Desc2", "Desc2_Recent")
df <- data.frame(cbind(Year, Term, Code, Description))
df <- df %>%
mutate(Year = Year %>% as.integer())
most_recent_descriptions <-
df %>%
group_by(Code) %>%
arrange(-Year) %>%
slice(1) %>%
transmute(
Code,
Most_Recent = Description
)
df %>%
left_join(most_recent_descriptions)
#> Joining, by = "Code"
#> Year Term Code Description Most_Recent
#> 1 2019 2019_T1 1 Desc1 Desc1
#> 2 2020 2020_T1 1 Desc1 Desc1
#> 3 2021 2021_T1 1 Desc1 Desc1
#> 4 2019 2019_T1 2 Desc2 Desc2_Recent
#> 5 2020 2020_T1 2 Desc2 Desc2_Recent
#> 6 2020 2020_T2 2 Desc2 Desc2_Recent
#> 7 2021 2021_T2 2 Desc2_Recent Desc2_Recent
由reprex包(v2.0.1)于2021-09-13创建
将Term
的值拆分为'Year'
和'Term'
,将arrange
的数据拆分为Year
和Term
,每个Code
得到last
的值。
library(dplyr)
library(tidyr)
df %>%
separate(Term, c('Year', 'Term'), sep = '_', convert = TRUE) %>%
arrange(Code, Year, order(gtools::mixedorder(Term))) %>%
group_by(Code) %>%
mutate(Most_Recent = last(Description)) %>%
ungroup
# Code Year Term Description Most_Recent
# <dbl> <int> <chr> <chr> <chr>
#1 1 2019 T1 Desc1 Desc1
#2 1 2020 T1 Desc1 Desc1
#3 1 2021 T1 Desc1 Desc1
#4 2 2019 T1 Desc2 Desc2_Recent
#5 2 2020 T1 Desc2 Desc2_Recent
#6 2 2020 T2 Desc2 Desc2_Recent
#7 2 2021 T1 Desc2_NotRecent Desc2_Recent
#8 2 2021 T2 Desc2_Recent Desc2_Recent
如果您希望Year
和Term
的值再次出现在一列中,您可以添加unite
来执行上述链。
...ungroup %>% unite(Term, Year, Term)
更新澄清后:见注释:
library(tidyverse)
df %>%
group_by(Code) %>%
separate(Term, c("Year", "T"), sep = "_", remove = FALSE) %>%
mutate(T = parse_number(T),
Most_recent = ifelse(Year == max(Year) & T == max(T), Description, NA)
) %>%
fill(Most_recent, .direction = "up")
输出:
Term Code Year T Description Most_recent
<chr> <chr> <chr> <dbl> <chr> <chr>
1 2019_T1 1 2019 1 Desc1 Desc1
2 2020_T1 1 2020 1 Desc1 Desc1
3 2021_T1 1 2021 1 Desc1 Desc1
4 2019_T1 2 2019 1 Desc2 Desc2_Recent
5 2020_T1 2 2020 1 Desc2 Desc2_Recent
6 2020_T2 2 2020 2 Desc2 Desc2_Recent
7 2021_T1 2 2021 1 Desc2_NotRecent Desc2_Recent
8 2021_T2 2 2021 2 Desc2_Recent Desc2_Recent
第一次回答:以下是tidyverse
方法,您可以获得您的结果:
library(dplyr)
library(tidyr)
df %>%
group_by(Code) %>%
mutate(Year = as.numeric(Year),
Most_recent = ifelse(Year == max(Year), Description, NA)
) %>%
fill(Most_recent, .direction = "up")
输出:
Year Term Code Description Most_recent
<dbl> <chr> <chr> <chr> <chr>
1 2019 2019_T1 1 Desc1 Desc1
2 2020 2020_T1 1 Desc1 Desc1
3 2021 2021_T1 1 Desc1 Desc1
4 2019 2019_T1 2 Desc2 Desc2_Recent
5 2020 2020_T1 2 Desc2 Desc2_Recent
6 2020 2020_T2 2 Desc2 Desc2_Recent
7 2021 2021_T2 2 Desc2_Recent Desc2_Recent