R:数据是随机的分钟间隔,需要平均到小时



我对R和编程都是新手。

我有一个超过200万行的数据集,我把它分成了3个独立的CSV文件。

当前的csv看起来像这样(为了简单起见,我删除了一些行):

Date        Time        Elevation
1   2011-01-01  0:00:00     3.532   
2   2011-01-01  0:15:00     3.538   
3   2011-01-01  0:30:00     3.541   
4   2011-01-01  0:45:00     3.546   
5   2011-01-01  1:00:00     3.548   
6   2011-01-01  1:15:00     3.555   
7   2011-01-01  1:30:00     3.561   
8   2011-01-01  1:45:00     3.566
9   2011-01-01  2:00:00     3.568
......
Date        Time        Elevation
484985  2018-04-21  0:00:00     3.15264 
484986  2018-04-21  0:05:00     3.15164 
484987  2018-04-21  0:10:00     3.14964     
484988  2018-04-21  0:15:00     3.14563 
484989  2018-04-21  0:20:00     3.14263
484990  2018-04-21  0:25:00     3.13962 

时间间隔与一些以15分钟为间隔测量的数据不一致,有些以5分钟为间隔测量。我希望得到这样的输出:

Date        Time        Elevation
1      2011-01-01  0:00:00          
2      2011-01-01  1:00:00        3.5393
3      2011-01-01  2:00:00        3.5575
4      2011-01-01  3:00:00        3.5788

我试着从一个旧的帖子,但它不工作,它只花了一整天的平均时间,而不是小时:

library(lubridate)
df[,2] <- ymd_hms(df[,2])
df %>% mutate(hour = hour(df[,2])) %>% 
group_by(hour) %>% summarise_at(vars(Elevation), mean)

我现在有这个将时间分成小时,分钟和秒的单独列:

library(readr)
library(tidyverse)
read_csv <- read.csv('dataset1.csv') %>% 
mutate(date_time = ymd_hms(paste0(Date, " ", Time)),)
df1 <- read_csv %>% 
separate(Time, into = c("Hour", "Min", "sec"), sep = ":", remove = F

我花了很长时间试图弄清楚,并通过类似的帖子在这里,但到目前为止没有任何工作。任何帮助将不胜感激!

如果您对海拔高度的简单平均值很满意,那么每隔15分钟切割日期/时间列,然后group_by函数可以处理这个:

df <- structure(list(Date = c("2011-01-01", "2011-01-01", "2011-01-01", 
"2011-01-01", "2011-01-01", "2011-01-01", "2011-01-01", "2011-01-01", 
"2011-01-01", "2018-04-21", "2018-04-21", "2018-04-21", "2018-04-21", 
"2018-04-21", "2018-04-21"), Time = c("0:00:00", "0:15:00", "0:30:00", 
"0:45:00", "1:00:00", "1:15:00", "1:30:00", "1:45:00", "2:00:00", 
"0:00:00", "0:05:00", "0:10:00", "0:15:00", "0:20:00", "0:25:00"
), Elevation = c(3.532, 3.538, 3.541, 3.546, 3.548, 3.555, 3.561, 
3.566, 3.568, 3.15264, 3.15164, 3.14964, 3.14563, 3.14263, 3.13962
)), class = "data.frame", row.names = c(NA, -15L))
#create the datetime column
df$datetime <- paste(df$Date, df$Time)
df$datetime <- as.POSIXct(df$datetime)
#cut(df$datetime, "1 hour")
#group and take the mean
library(dplyr)
df %>% group_by(cut(df$datetime, "1 hour")) %>%
summarize(Elevation = mean(Elevation))

对于200万行,这将花费一些时间,但不应该太长。

library(tibble)
library(dplyr)
library(lubridate)
tibble::tribble(
~Date,      ~Time, ~Elevation,
"2011-01-01",  "0:00:00",  3.532,
"2011-01-01",  "0:15:00",  3.538,
"2011-01-01",  "0:30:00",  3.541,
"2011-01-01",  "0:45:00",  3.546,
"2011-01-01",  "1:00:00",  3.548,
"2011-01-01",  "1:15:00",  3.555,
"2011-01-01",  "1:30:00",  3.561,
"2011-01-01",  "1:45:00",  3.566,
"2011-01-01",  "2:00:00",  3.568
) %>% 
dplyr::mutate(
Date = lubridate::date(Date),
Time = lubridate::hms(Time),
Hour = lubridate::hour(Time)) %>% 
dplyr::group_by(Date, Hour) %>% 
dplyr::summarise(Elevation = mean(Elevation, na.rm = TRUE))
# A tibble: 3 × 3
# Groups:   Date [1]
Date        Hour Elevation
<date>     <dbl>     <dbl>
1 2011-01-01     0      3.54
2 2011-01-01     1      3.56
3 2011-01-01     2      3.57

我们可以用floor_date

library(dplyr)
library(lubridate)
library(tidyr)
read_csv %>%
group_by(grp = floor_date(date_time, "hour")) %>%
summarise(Elevation = mean(Elevation, na.rm = TRUE)) %>%
separate(grp, into = c("Date", "Time"), sep = " ") 

最新更新