根据字符串条件将每小时数据折叠为每日数据并创建虚拟变量



我有一些数据,看起来像:

# A tibble: 6 x 3
Time    Date       Weather             
<chr>   <date>     <chr>               
1 "7:00 " 2010-01-01 Passing clouds      
2 "7:30 " 2010-01-01 Passing clouds      
3 "8:00 " 2010-01-01 Passing clouds      
4 "8:30 " 2010-01-01 Passing clouds      
5 "9:00 " 2010-01-01 Partly sunny        
6 "9:30 " 2010-01-01 Drizzle Partly sunny

其中每天有每小时的数据。我正在尝试将其折叠为每日系列并创建一些虚拟变量,但不是每 30 分钟间隔一次。

也就是说,当我当前创建虚拟变量时,它会创建太多列。这就是为什么我试图根据某些条件折叠它。条件是,如果Weather有 4 个连续的观测值相同,则保持它。即Passing clouds有 4 个连续的Weather条件,但Partly sunny没有,Drizzle Partly sunny也是如此。

我目前有以下内容:

library(splitstackshape)
df %>% 
group_by(Date) %>% 
arrange(Weather) %>% 
distinct(Weather) %>% 
summarise(text = paste(Weather, collapse = "_")) %>% 
cSplit_e(., split.col = "text", sep = "_", type = "character",
mode = "binary", fixed = TRUE, fill = 0)

但这适用于Weather列中的所有独特观察结果,这给了我太多的列。所以我正在尝试添加一个条件,仅在列有 4 个或更多连续观察时才保留列。

数据:

df <- structure(list(Time = c("7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", 
"9:30 ", "10:00", "10:30", "11:00", "11:30", "12:00", "12:30", 
"1:00 ", "1:30 ", "2:00 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ", 
"4:30 ", "5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:00 ", 
"7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "10:30", 
"11:00", "11:30", "12:00", "12:30", "1:00 ", "1:30 ", "2:00 ", 
"2:30 ", "3:00 ", "3:30 ", "4:00 ", "4:30 ", "5:00 ", "5:30 ", 
"6:00 ", "6:30 ", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", 
"9:30 ", "10:00", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", 
"9:30 ", "10:00", "10:30", "11:00", "11:30", "12:00", "12:30", 
"1:00 ", "1:30 ", "2:00 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ", 
"4:30 ", "5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:30 ", 
"8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "7:00 ", "7:30 ", 
"8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "10:30", "11:00", 
"11:30", "12:00", "12:30", "1:00 ", "1:30 ", "2:00 ", "2:30 ", 
"3:00 ", "3:30 ", "4:00 ", "4:30 ", "5:00 ", "5:30 ", "6:00 ", 
"6:30 ", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ", 
"10:00", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ", 
"10:00", "10:30", "11:00", "11:30", "12:00", "12:30", "1:00 ", 
"1:30 ", "2:00 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ", "4:30 ", 
"5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:30 ", "7:00 ", 
"7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "10:30", 
"11:00", "11:30", "12:00", "1:00 ", "1:30 ", "2:00 ", "2:30 ", 
"3:00 ", "3:30 ", "4:00 ", "4:30 ", "5:00 ", "5:30 ", "6:00 ", 
"6:30 ", "7:00 ", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", 
"9:30 ", "10:00", "10:30", "11:00", "11:30", "12:00", "12:30", 
"1:00 ", "1:30 ", "2:05 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ", 
"4:30 ", "5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:30 ", 
"8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "7:00 "), Date = structure(c(14610, 
14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610, 
14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610, 
14610, 14610, 14610, 14610, 14610, 14610, 14611, 14611, 14611, 
14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 
14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 
14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 
14611, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 
14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 
14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 
14612, 14612, 14612, 14612, 14612, 14613, 14613, 14613, 14613, 
14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 
14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 
14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 
14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 
14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 
14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 14615, 
14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615, 
14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615, 
14615, 14615, 14615, 14615, 14615, 14616, 14616, 14616, 14616, 
14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 
14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 
14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 
14617), class = "Date"), Weather = c("Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Partly sunny", "Drizzle Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Drizzle Partly sunny", "Drizzle Partly sunny", 
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Fog", 
"Passing clouds", "Passing clouds", "Light fog", "Scattered clouds", 
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Broken clouds", "Partly cloudy", "Partly cloudy", "Partly cloudy", 
"Partly cloudy", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Rain Partly sunny", "Rain Partly sunny", "Rain Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Drizzle Fog", "Drizzle Fog", "Drizzle Fog", 
"Drizzle Fog", "Drizzle Fog", "Drizzle Fog", "Drizzle Fog", "Fog", 
"Fog", "Fog", "Fog", "Light rain Fog", "Light rain Fog", "Rain Fog", 
"Rain Fog", "Rain Fog", "Rain Fog", "Rain Fog", "Rain Fog", "Fog", 
"Partly sunny", "Broken clouds", "Broken clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Light rain Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Fog", 
"Fog", "Fog", "Fog", "Fog", "Fog", "Fog", "Fog", "Fog", "Partly sunny", 
"Broken clouds", "Broken clouds", "Broken clouds", "Broken clouds", 
"Broken clouds", "Broken clouds", "Broken clouds", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Scattered clouds", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Partly cloudy", 
"Broken clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Scattered clouds", "Broken clouds", "Broken clouds", "Broken clouds", 
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Scattered clouds", "Scattered clouds", "Passing clouds", "Passing clouds", 
"Rain Low clouds", "Rain Low clouds", "Rain Low clouds", "Rain Low clouds", 
"Light rain Mostly cloudy", "Light rain Mostly cloudy", "Light rain Mostly cloudy", 
"Light rain Mostly cloudy", "Rain Low clouds", "Light rain Mostly cloudy", 
"Light rain Mostly cloudy", "Rain Mostly cloudy", "Snow Mostly cloudy", 
"Snow Mostly cloudy", "Snow Ice fog", "Snow Ice fog", "Snow Ice fog", 
"Snow Ice fog", "Snow Ice fog", "Snow Ice fog", "Snow Ice fog", 
"Snow Ice fog", "Light snow Ice fog", "Light snow Ice fog", "Ice fog", 
"Passing clouds", "Partly cloudy", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -200L))

您是否正在寻找这样的东西:

library(dplyr)df_new <- df %>% 
group_by(Date) %>% 
mutate(repeated = rep(rle(Weather)$lengths, rle(Weather)$lengths)) %>% 
filter(repeated >= 4)
df_new
#> # A tibble: 148 x 4
#> # Groups:   Date [7]
#>    Time    Date       Weather        repeated
#>    <chr>   <date>     <chr>             <int>
#>  1 "7:00 " 2010-01-01 Passing clouds        4
#>  2 "7:30 " 2010-01-01 Passing clouds        4
#>  3 "8:00 " 2010-01-01 Passing clouds        4
#>  4 "8:30 " 2010-01-01 Passing clouds        4
#>  5 10:00   2010-01-01 Partly sunny         10
#>  6 10:30   2010-01-01 Partly sunny         10
#>  7 11:00   2010-01-01 Partly sunny         10
#>  8 11:30   2010-01-01 Partly sunny         10
#>  9 12:00   2010-01-01 Partly sunny         10
#> 10 12:30   2010-01-01 Partly sunny         10
#> # … with 138 more rows
df_new %>% 
summarise(text = paste(unique(Weather), collapse = "_")) 
#> # A tibble: 7 x 2
#>   Date       text                                                          
#>   <date>     <chr>                                                         
#> 1 2010-01-01 Passing clouds_Partly sunny_Scattered clouds                  
#> 2 2010-01-02 Scattered clouds_Partly sunny_Partly cloudy_Passing clouds    
#> 3 2010-01-03 Passing clouds_Partly sunny                                   
#> 4 2010-01-04 Drizzle Fog_Fog_Rain Fog_Passing clouds                       
#> 5 2010-01-05 Fog_Broken clouds_Partly sunny                                
#> 6 2010-01-06 Scattered clouds                                              
#> 7 2010-01-07 Rain Low clouds_Light rain Mostly cloudy_Snow Ice fog_Passing…

创建于 2019-11-25 由 reprex 软件包 (v0.3.0(

rle对重复的连续值进行计数。我用rep包装它以使其成为mutate调用的正确长度,但您也可以单独运行它以掌握它的工作原理(如果您还不知道(。一旦您知道每个值的重复频率,就很容易先filter然后summarise

最新更新