为循环嵌套的R返回意外结果



背景我使用R中的嵌套循环从移动应用程序中排除用户的重叠会话数据。由于我无法共享数据,我正在使用nycflights13包中的航班数据帧,在该数据帧中我也成功地复制了问题。

目标:排除所有从同一目的地起飞并在时间上重叠的航班(同时在空中)。在重叠的航班中,始终选择航班号最高的航班(在原始数据中,这是有意义的)。

问题:我最终只乘坐了12次航班,而不是预定的几千次。除了令人痛苦的低效之外,你能看出问题出在哪里吗?

再现问题所需的程序包:tidyverse,[nycflights13][1]

我们的"解决方案">

# data (flights from package nycflights13)
df_flights <- flights
df_flights2 <- df_flights %>% 
drop_na() %>% 
filter(carrier == "MQ") %>% # to decrease running time, just a piece of the data
mutate(
unique_n_day = as.factor(as.numeric(date(time_hour))), # creating unique number for a day to loop over
dest = as.factor(dest),
air_time = air_time*60) # converting to seconds

flight_list <- list()
## loop
for (i in levels(df_flights2$dest)){
df_dest <- df_flights2[df_flights2$dest == i,]

for (d in levels(df_dest$unique_n_day)){
df_day <- df_dest[df_dest$unique_n_day == d,]

for(n in 1:nrow(df_day)){

df <- df_day[df_day$time_hour >= df_day$time_hour[n] & 
df_day$time_hour <= df_day$time_hour[n] +  df_day$air_time[n],]

if (nrow(df) >= 1){
flight_list[[n]] <- df[which(df$flight==max(df$flight))[1],]

}else(d <- "I know it is silly") # max() complains when the df is empty-> if statement 
}  
}
}
# unlisting
fulldata_flight <- do.call("rbind", flight_list)
# dropping duplicated values 
fulldata_flight_clean <- distinct(fulldata_flight)

EDIT:如果要查找返回重叠间隔的函数,那么我建议使用data.table::foverlaps()函数,因为下面的解决方案不能完美地捕捉所有内容。[1] :https://github.com/hadley/nycflights13

您正在代码的第29行中重用n

flight_list[[n]] <- df[which(df$flight==max(df$flight))[1],]

然而,n在每天的航班和目的地上迭代,恰好是14个最大值(第23行):

for(n in 1:nrow(df_day))

这意味着结果将被覆盖。只需将最后一个列表迭代器设置在第一个循环之外,并在添加到列表后手动推进它。

此外,通过首先识别所有有效的目的地和日期组合,您可以去掉一个for循环。请参阅下面的完整代码。

library(tidyverse)
library(nycflights13)
library(lubridate)
# data (flights from package nycflights13)
df_flights <- flights
df_flights2 <- df_flights %>% 
drop_na() %>% 
filter(carrier == "MQ") %>% # to decrease running time, just a piece of the data
mutate(
unique_n_day = as.factor(as.numeric(date(time_hour))), # creating unique number for a day to loop over
dest = as.factor(dest),
air_time = air_time*60) # converting to seconds

flight_list <- list()
list_id <- 1 # Results list iterator
# Get all valid destination and day combinations
dest_day_comb <- df_flights2 %>%
group_by(dest, unique_n_day) %>%
count() %>%
ungroup()
## loop
for (i in seq(nrow(dest_day_comb))) {
current_comb <- dest_day_comb[i,]
df_day <- df_flights2 %>%
filter(dest == current_comb$dest,
unique_n_day == current_comb$unique_n_day)
# Since we are only iterating over valid combinations, there is no need to check if df_day is > 0.
for(n in 1:nrow(df_day)){

df <- df_day[df_day$time_hour >= df_day$time_hour[n] & 
df_day$time_hour <= df_day$time_hour[n] +  df_day$air_time[n],]

flight_list[[list_id]] <- df[which(df$flight==max(df$flight))[1],] # Using the list_id to add to the results list.
# Info: This creates duplicates since valid solutions are added nrow(df) times.
# However, these duplicates are removed later on.
list_id <- list_id + 1
}  
}
# unlisting
fulldata_flight <- do.call("rbind", flight_list)
# dropping duplicated values 
fulldata_flight_clean <- distinct(fulldata_flight)

最新更新