我试图过滤贷款数据框架,但是每个月报告都重复贷款,如果他们仍然未偿还,或者如果支付了贷款(不能只使用最新的月报告)。我希望通过贷方筛选贷款的唯一到期日,并删除重复项,并仅按报告日期保留最新的数据。下面是一个数据示例:
df <- data.frame(Reporting.date=c("6/30/2020","6/30/2020","6/30/2020","8/31/2021","8/31/2021"
,"8/31/2021","6/30/2020","7/31/2021","5/31/2020","12/31/2020")
, Lender.name=c("Lender1","Lender1","Lender1","Lender1","Lender1","Lender1"
,"Lender1","Lender1","Lender2","Lender2")
, Date.of.maturity=c("6/20/2025","6/20/2025","6/20/2025","6/20/2025","6/20/2025"
,"6/20/2025","6/30/2022","6/30/2022","5/15/2024","5/15/2024")
, Loan.amount=c(13129474,14643881,44935677,13129474,14643881,44935677
,150000,150000,2750000,2750000))
从示例数据中可以看到,Lender1有2个唯一的到期日。第一个到期日有3笔贷款在2个报告日期重复,第二个到期日有1笔贷款重复。我想删除重复项以保留最新的报告数据。我的希望是得到一个之后看起来像这样的数据帧:
贷款人名称 | 日期。贷款金额 | |
---|---|---|
13129474 |
您需要将Reporting.date
转换为日期格式,无论是在mutate
中(像我一样)还是直接在filter
中。
library(tidyverse)
df %>%
mutate(Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')) %>%
group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
filter(Reporting.date == max(Reporting.date)) %>%
ungroup()
对于arrange
我们也可以这样做
library(dplyr)
library(lubridate)
df %>%
arrange(Lender.name, Date.of.maturity, Loan.amount,
desc(mdy(Reporting.date))) %>%
group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
slice_head(n = 1) %>%
ungroup
与产出
# A tibble: 5 x 4
Reporting.date Lender.name Date.of.maturity Loan.amount
<chr> <chr> <chr> <dbl>
1 8/31/2021 Lender1 6/20/2025 13129474
2 8/31/2021 Lender1 6/20/2025 14643881
3 8/31/2021 Lender1 6/20/2025 44935677
4 7/31/2021 Lender1 6/30/2022 150000
5 12/31/2020 Lender2 5/15/2024 2750000
Base R选项使用subset
,transform
和ave
-
subset(transform(df, Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')),
Reporting.date == ave(Reporting.date, Lender.name, Date.of.maturity, FUN = max))
# Reporting.date Lender.name Date.of.maturity Loan.amount
#4 2021-08-31 Lender1 6/20/2025 13129474
#5 2021-08-31 Lender1 6/20/2025 14643881
#6 2021-08-31 Lender1 6/20/2025 44935677
#8 2021-07-31 Lender1 6/30/2022 150000
#10 2020-12-31 Lender2 5/15/2024 2750000