在包含交通阻塞的数据帧中,提供了开始日期、结束日期、开始时间、结束时间、道路、方向(右或左)、开始路段和结束路段。一个示例如下:
start date end date start time end time road direction start road section end road section
2015-03-02 2015-03-02 07:29 09:45 A1 R 113.1 125.7
2017-05-04 2017-05-05 23:29 01:45 A10 L 12.5 14.3
另一个大型数据帧(2000多万行)包含2015年至2020年期间所有可能的路段(以20公里为单位)和时间(以小时为单位)组合。一个示例如下:
date time road direction road section
2015-03-02 09:00-10:00 A1 R 100 - 120
2015-03-02 10:00-11:00 A1 R 100 - 120
2015-03-02 09:00-10:00 A1 R 80 - 100
我想在第二个数据帧中添加一个二进制变量,表示在某一路段当时是否有交通堵塞。在本例中,第一行的值为1,因为时间和路段范围都与第一个数据帧的第一行中的交通堵塞重叠。然而,第二行和第三行取值为0,因为时间和路段不重叠。
任何帮助和/或建议都是非常感谢的!事实证明,这是一项相当艰巨的任务,其解决方案涉及相当一些工作。最后,这是一个美好的周日下午的锻炼。让我们希望它适用于您的完整数据集。
ex <- data.frame(start_date = c("2015-03-02", "2017-05-04"),
end_date = c("2015-03-02", "2017-05-05"),
start_time = c("07:29", "23:29"),
end_time = c("09:45", "01:45"),
road = c("A1", "A10"),
direction = c("R", "L"),
start_road_section = c(113.1, 12.5),
end_road_section = c(125.7, 14.3))
df <- data.frame(date = c("2015-03-02", "2015-03-02", "2015-03-02"),
time = c("09:00-10:00", "10:00-11:00", "09:00-10:00"),
road = c("A1", "A1", "A1"),
direction = c("R", "R", "R"),
road_section = c("100 - 120", "100 - 120", "80 - 100"))
第一步:清理数据
df %<>% mutate(time_component = strsplit(as.character(time), "-")) %>%
tidyr::unnest_wider(time_component, names_sep = "_") %>%
select(-time) %>%
mutate(road_component = strsplit(as.character(road_section), " - ")) %>%
tidyr::unnest_wider(road_component, names_sep = "_") %>%
select(-road_section)
> df
# A tibble: 3 x 7
date road direction time_component_1 time_component_2 road_component_1 road_component_2
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2015-03-02 A1 R 09:00 10:00 100 120
2 2015-03-02 A1 R 10:00 11:00 100 120
3 2015-03-02 A1 R 09:00 10:00 80 100
第二步:合并数据
comb_df <- dplyr::inner_join(ex,df, by=c("start_date" = "date", "road", "direction"))
comb_df$start_road_section <- as.numeric(comb_df$start_road_section)
comb_df$end_road_section <- as.numeric(comb_df$end_road_section)
comb_df$road_component_1 <- as.numeric(comb_df$road_component_1)
comb_df$road_component_2 <- as.numeric(comb_df$road_component_2)
第三步:寻找时间与路段的重叠
new_df <- comb_df %>% mutate(ex_start = lubridate::ymd_hm(paste(start_date, start_time, sep = " ")),
ex_end = lubridate::ymd_hm(paste(end_date, end_time, sep = " ")),
df_start = lubridate::ymd_hm(paste(start_date, time_component_1, sep = " ")),
df_end = lubridate::ymd_hm(paste(start_date, time_component_2, sep = " ")),
Intervall_ex = lubridate::interval(start = ex_start, end = ex_end),
Intervall_df = lubridate::interval(start = df_start, end = df_end),
Time_Contained = ifelse(int_overlaps(Intervall_ex, Intervall_df), 1, 0),
Road_Contained = ifelse(((start_road_section < road_component_1) & (end_road_section > road_component_1)) |
((start_road_section > road_component_1) & (start_road_section < road_component_2)),
1,
0),
Final_Result = ifelse(Time_Contained + Road_Contained == 2, 1, 0))
第4步也是最后一步:减少数据集
final_df <- new_df %>% select(start_date, end_date, road,
direction, road_component_1, road_component_2,
time_component_1, time_component_2, Final_Result)
> final_df
start_date end_date road direction road_component_1 road_component_2 time_component_1 time_component_2 Final_Result
1 2015-03-02 2015-03-02 A1 R 100 120 09:00 10:00 1
2 2015-03-02 2015-03-02 A1 R 100 120 10:00 11:00 0
3 2015-03-02 2015-03-02 A1 R 80 100 09:00 10:00 0