我对R和编程都是新手。
我有一个超过200万行的数据集,我把它分成了3个独立的CSV文件。
当前的csv看起来像这样(为了简单起见,我删除了一些行):
Date Time Elevation
1 2011-01-01 0:00:00 3.532
2 2011-01-01 0:15:00 3.538
3 2011-01-01 0:30:00 3.541
4 2011-01-01 0:45:00 3.546
5 2011-01-01 1:00:00 3.548
6 2011-01-01 1:15:00 3.555
7 2011-01-01 1:30:00 3.561
8 2011-01-01 1:45:00 3.566
9 2011-01-01 2:00:00 3.568
......
Date Time Elevation
484985 2018-04-21 0:00:00 3.15264
484986 2018-04-21 0:05:00 3.15164
484987 2018-04-21 0:10:00 3.14964
484988 2018-04-21 0:15:00 3.14563
484989 2018-04-21 0:20:00 3.14263
484990 2018-04-21 0:25:00 3.13962
时间间隔与一些以15分钟为间隔测量的数据不一致,有些以5分钟为间隔测量。我希望得到这样的输出:
Date Time Elevation
1 2011-01-01 0:00:00
2 2011-01-01 1:00:00 3.5393
3 2011-01-01 2:00:00 3.5575
4 2011-01-01 3:00:00 3.5788
我试着从一个旧的帖子,但它不工作,它只花了一整天的平均时间,而不是小时:
library(lubridate)
df[,2] <- ymd_hms(df[,2])
df %>% mutate(hour = hour(df[,2])) %>%
group_by(hour) %>% summarise_at(vars(Elevation), mean)
我现在有这个将时间分成小时,分钟和秒的单独列:
library(readr)
library(tidyverse)
read_csv <- read.csv('dataset1.csv') %>%
mutate(date_time = ymd_hms(paste0(Date, " ", Time)),)
df1 <- read_csv %>%
separate(Time, into = c("Hour", "Min", "sec"), sep = ":", remove = F
我花了很长时间试图弄清楚,并通过类似的帖子在这里,但到目前为止没有任何工作。任何帮助将不胜感激!
如果您对海拔高度的简单平均值很满意,那么每隔15分钟切割日期/时间列,然后group_by函数可以处理这个:
df <- structure(list(Date = c("2011-01-01", "2011-01-01", "2011-01-01",
"2011-01-01", "2011-01-01", "2011-01-01", "2011-01-01", "2011-01-01",
"2011-01-01", "2018-04-21", "2018-04-21", "2018-04-21", "2018-04-21",
"2018-04-21", "2018-04-21"), Time = c("0:00:00", "0:15:00", "0:30:00",
"0:45:00", "1:00:00", "1:15:00", "1:30:00", "1:45:00", "2:00:00",
"0:00:00", "0:05:00", "0:10:00", "0:15:00", "0:20:00", "0:25:00"
), Elevation = c(3.532, 3.538, 3.541, 3.546, 3.548, 3.555, 3.561,
3.566, 3.568, 3.15264, 3.15164, 3.14964, 3.14563, 3.14263, 3.13962
)), class = "data.frame", row.names = c(NA, -15L))
#create the datetime column
df$datetime <- paste(df$Date, df$Time)
df$datetime <- as.POSIXct(df$datetime)
#cut(df$datetime, "1 hour")
#group and take the mean
library(dplyr)
df %>% group_by(cut(df$datetime, "1 hour")) %>%
summarize(Elevation = mean(Elevation))
对于200万行,这将花费一些时间,但不应该太长。
library(tibble)
library(dplyr)
library(lubridate)
tibble::tribble(
~Date, ~Time, ~Elevation,
"2011-01-01", "0:00:00", 3.532,
"2011-01-01", "0:15:00", 3.538,
"2011-01-01", "0:30:00", 3.541,
"2011-01-01", "0:45:00", 3.546,
"2011-01-01", "1:00:00", 3.548,
"2011-01-01", "1:15:00", 3.555,
"2011-01-01", "1:30:00", 3.561,
"2011-01-01", "1:45:00", 3.566,
"2011-01-01", "2:00:00", 3.568
) %>%
dplyr::mutate(
Date = lubridate::date(Date),
Time = lubridate::hms(Time),
Hour = lubridate::hour(Time)) %>%
dplyr::group_by(Date, Hour) %>%
dplyr::summarise(Elevation = mean(Elevation, na.rm = TRUE))
# A tibble: 3 × 3
# Groups: Date [1]
Date Hour Elevation
<date> <dbl> <dbl>
1 2011-01-01 0 3.54
2 2011-01-01 1 3.56
3 2011-01-01 2 3.57
我们可以用floor_date
library(dplyr)
library(lubridate)
library(tidyr)
read_csv %>%
group_by(grp = floor_date(date_time, "hour")) %>%
summarise(Elevation = mean(Elevation, na.rm = TRUE)) %>%
separate(grp, into = c("Date", "Time"), sep = " ")