我在识别重叠日期时遇到问题。我有一个非常大的数据集,包含大量的重复数据。这是的样本
A tibble: 20 x 9 (df)
pat_id praktijk_id Geslacht Geboortedatum Inschrijfdatum Uitschrijfdatum inschrijf_first inschrijf_last Dup_N
<int> <int> <chr> <chr> <date> <date> <date> <date> <int>
1 17895 5 V 2013-09-01 2013-09-24 2022-01-01 2013-10-01 2019-04-01 19
2 303260 29 V 2013-09-01 2019-08-26 2022-01-01 2019-10-01 2020-10-01 19
3 175084 53 M 1995-01-01 2019-08-28 2022-01-01 2019-10-01 2020-01-01 501
4 14050 51 M 1995-01-01 2020-04-14 2022-01-01 2020-07-01 2020-10-01 501
5 418695 47 M 1986-11-01 2001-06-29 2022-01-01 2014-07-01 2020-07-01 557
6 242117 22 M 1986-11-01 2020-09-22 2022-01-01 2020-10-01 NA 557
7 310296 24 M 1967-10-01 2013-08-30 2022-01-01 2014-01-01 2020-10-01 661
8 17275 61 M 1967-10-01 2012-01-11 2013-09-01 2013-01-01 2013-07-01 661
9 121166 5 M 1991-07-01 1999-09-01 2015-03-31 2006-01-01 2015-01-01 975
10 113649 27 M 1991-07-01 2018-04-18 2022-01-01 2018-07-01 2020-10-01 975
11 149016 66 V 1985-01-01 2015-05-12 2016-02-29 2015-07-01 2016-01-01 1067
12 47918 40 V 1985-01-01 2016-09-02 2022-01-01 2016-10-01 2020-10-01 1067
13 423021 44 V 1992-07-01 2014-03-03 2014-05-27 2014-04-01 NA 1079
14 31504 44 V 1992-07-01 2014-04-11 2016-02-18 2014-07-01 2016-01-01 1079
15 194200 4 M 1969-10-01 1996-09-30 2022-01-01 2006-01-01 2016-04-01 1195
16 261328 1 M 1969-10-01 2015-10-21 2016-01-19 2012-07-01 2016-01-01 1195
17 131040 39 V 1993-07-01 2006-09-06 2015-01-30 2014-01-01 2015-01-01 1352
18 321094 40 V 1993-07-01 2020-09-15 2022-01-01 2020-10-01 NA 1352
19 494575 39 V 1996-01-01 2013-10-23 2015-12-10 2014-01-01 2015-10-01 1355
20 58560 38 V 1996-01-01 2015-12-09 2022-01-01 2016-01-01 2020-10-01 1355
df <- structure(list(pat_id = c(17895L, 303260L, 175084L, 14050L, 418695L,
242117L, 310296L, 17275L, 121166L, 113649L, 149016L, 47918L,
423021L, 31504L, 194200L, 261328L, 131040L, 321094L, 494575L,
58560L), praktijk_id = c(5L, 29L, 53L, 51L, 47L, 22L, 24L, 61L,
5L, 27L, 66L, 40L, 44L, 44L, 4L, 1L, 39L, 40L, 39L, 38L), Geslacht = c("V",
"V", "M", "M", "M", "M", "M", "M", "M", "M", "V", "V", "V", "V",
"M", "M", "V", "V", "V", "V"), Geboortedatum = c("2013-09-01",
"2013-09-01", "1995-01-01", "1995-01-01", "1986-11-01", "1986-11-01",
"1967-10-01", "1967-10-01", "1991-07-01", "1991-07-01", "1985-01-01",
"1985-01-01", "1992-07-01", "1992-07-01", "1969-10-01", "1969-10-01",
"1993-07-01", "1993-07-01", "1996-01-01", "1996-01-01"), Inschrijfdatum = structure(c(15972,
18134, 18136, 18366, 11502, 18527, 15947, 15350, 10835, 17639,
16567, 17046, 16132, 16171, 9769, 16729, 13397, 18520, 16001,
16778), class = "Date"), Uitschrijfdatum = structure(c(18993,
18993, 18993, 18993, 18993, 18993, 18993, 15949, 16525, 18993,
16860, 18993, 16217, 16849, 18993, 16819, 16465, 18993, 16779,
18993), class = "Date"), inschrijf_first = structure(c(15979,
18170, 18170, 18444, 16252, 18536, 16071, 15706, 13149, 17713,
16617, 17075, 16161, 16252, 13149, 15522, 16071, 18536, 16071,
16801), class = "Date"), inschrijf_last = structure(c(17987,
18536, 18262, 18536, 18444, NA, 18536, 15887, 16436, 18536, 16801,
18536, NA, 16801, 16892, 16801, 16436, NA, 16709, 18536), class = "Date"),
Dup_N = c(19L, 19L, 501L, 501L, 557L, 557L, 661L, 661L, 975L,
975L, 1067L, 1067L, 1079L, 1079L, 1195L, 1195L, 1352L, 1352L,
1355L, 1355L)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
我想在chrijf_first&每个重复的Dup_N的inschrijf_tlast。
|---- DateRange Dup_N duplicate (1) -----| _
_ |---Date Range Dup_N duplicate (2) ----|
_ |---Date Range Dup_N duplicate (3) ----|
如果日期"接触",例如,对于重复的第一行inschrijf_first=inschrijf _flast,则不视为重叠。(见下文(
|---- DateRange Dup_N duplicate (1) -----| _
_ |---Date Range Dup_N duplicate (2) ----|
预期的输出将是一个添加的列,标记重复的重复数据行以进行重叠。
我尝试了以下代码:它标记重叠的时间段。然而,它也标记了"感人"的日期(如上所述(。
sqldf("select
a.*,
count(b.rowid) > 0 as overlap,
coalesce(group_concat(b.rowid), '') as overlaps
from df a
left join df b on a.Dup_N = b.Dup_N and
not a.rowid = b.rowid and
((a.inschrijf_first between b.inschrijf_first and b.inschrijf_last) or
(b.inschrijf_first between a.inschrijf_first and a.inschrijf_last))
group by a.rowid
order by a.rowid")
在标记具有重叠时间段的重复项之后,我想将没有重叠的重复项_N组合起来。
谨致问候!
使用tidyverse
,可以得到一个带有标记重叠行的列(只显示最后3列(。
一些备注:
- 这只适用于2个重复项(2行(,因为它没有定义当一个重叠而另一个不重叠时该怎么办(3+行(
- 没有定义范围(只有一个日期(。现在它正在返回
NA
- 目前还不清楚如何";"组合";非重叠行
library(dplyr)
library(tidyr)
df %>%
group_by(Dup_N) %>%
mutate( overlap=(lead(inschrijf_first)>inschrijf_first &
lead(inschrijf_first)<inschrijf_last)|
(lead(inschrijf_last)>inschrijf_first &
lead(inschrijf_last)<inschrijf_last) ) %>%
fill(overlap, .direction="downup") %>%
ungroup() %>%
print(width=90)
# A tibble: 20 × 10
... inschrijf_first inschrijf_last Dup_N overlap
... <date> <date> <int> <lgl>
... 1 2013-10-01 2019-04-01 19 FALSE
... 2 2019-10-01 2020-10-01 19 FALSE
... 3 2019-10-01 2020-01-01 501 FALSE
... 4 2020-07-01 2020-10-01 501 FALSE
... 5 2014-07-01 2020-07-01 557 NA
... 6 2020-10-01 NA 557 NA
... 7 2014-01-01 2020-10-01 661 FALSE
... 8 2013-01-01 2013-07-01 661 FALSE
... 9 2006-01-01 2015-01-01 975 FALSE
... 10 2018-07-01 2020-10-01 975 FALSE
... 11 2015-07-01 2016-01-01 1067 FALSE
... 12 2016-10-01 2020-10-01 1067 FALSE
... 13 2014-04-01 NA 1079 NA
... 14 2014-07-01 2016-01-01 1079 NA
... 15 2006-01-01 2016-04-01 1195 TRUE
... 16 2012-07-01 2016-01-01 1195 TRUE
... 17 2014-01-01 2015-01-01 1352 NA
... 18 2020-10-01 NA 1352 NA
... 19 2014-01-01 2015-10-01 1355 FALSE
... 20 2016-01-01 2020-10-01 1355 FALSE