我读过多个线程解释说,如果有更好的方法我想学习的话,应该不鼓励使用for循环。我会说我已经尝试过将summarize()
与group_by()
结合使用。
我正在努力实现的是,我想开发一个气候数据库。我已经成功地将R编程为直接从源下载数据,并将列表转换为data.frame。现在我想按月份和年份对多个列进行求和和和/或求平均值。因此,我为什么尝试使用summarize
和group_by
。我的问题是数据带有代码"M"或"T",我想保留这些代码,所以我任意地给它们取M=9999和T=9998的整数。我想,当操作代码时,我可以使用for循环逐行求值,将这两个占位符转换为"0",并返回该子集中的"M"one_answers"T"的数量。
以下是数据到达的方式:
$data
# A tibble: 935 x 8
date datatype station value fl_m fl_q fl_so fl_t
<chr> <chr> <chr> <int> <chr> <chr> <chr> <chr>
1 2020-01-01T00:0~ PRCP GHCND:USW0002~ 76 "" "" W "240~
2 2020-01-01T00:0~ SNOW GHCND:USW0002~ 0 "T" "" W ""
3 2020-01-01T00:0~ SNWD GHCND:USW0002~ 0 "T" "" W ""
4 2020-01-01T00:0~ TMAX GHCND:USW0002~ 39 "" "" W "240~
5 2020-01-01T00:0~ TMIN GHCND:USW0002~ -5 "" "" W "240~
6 2020-01-02T00:0~ PRCP GHCND:USW0002~ 3 "" "" W "240~
7 2020-01-02T00:0~ SNOW GHCND:USW0002~ 5 "" "" W ""
8 2020-01-02T00:0~ SNWD GHCND:USW0002~ 0 "" "" W ""
9 2020-01-02T00:0~ TMAX GHCND:USW0002~ 11 "" "" W "240~
10 2020-01-02T00:0~ TMIN GHCND:USW0002~ -10 "" "" W "240~
# ... with 925 more rows
这是我用来把它从列表变成数据的代码。帧:
## Convert a list from NCDC into a data frame
## mso_data is a placeholder file for the downloaded data from NCDC
## mso_light2 is a placeholder for the destination data frame
## NCDC downloads in a list, the data is stored in the $data portion
library(tidyverse)
## first convert from list to data.frame and remove 'station ID' column
mso_light2 <- mso_data$data[, -3]
## remove time from date group
mso_date <- mso_light2[1]
mso_date <- sub("T.*", "", mso_date$date)
mso_light2$date <- mso_date
## remove flags for fl_so? and fl_t (time)
mso_light2 <- mso_light2[1:5]
## Change 'T' = 9998 & 'M' = 9999
mso_light2$value[mso_light2$fl_m == "T"] <- 9998
mso_light2$value[mso_light2$fl_q == "M"] <- 9999
## pivot data frame
## eventually use to change column names
## v_names <- c('PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN')
mso_light2 <- mso_light2[1:3]
mso_light2 <- pivot_wider(mso_light2,
names_from = datatype,
values_from = value)
这就是数据帧在转换后的样子,我添加了月份和年份以及平均日温度"TAVG"的列:
# A tibble: 187 x 9
# Rowwise:
date PRCP SNOW SNWD TMAX TMIN TAVG month year
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2020-01-01 76 9998 9998 39 -5 17 1 2020
2 2020-01-02 3 5 0 11 -10 0.5 1 2020
3 2020-01-03 5 8 9998 61 -38 11.5 1 2020
4 2020-01-04 8 9998 0 33 -66 -16.5 1 2020
5 2020-01-05 5 10 0 33 -21 6 1 2020
6 2020-01-06 9998 9998 9998 33 -38 -2.5 1 2020
7 2020-01-07 9998 0 0 78 -10 34 1 2020
8 2020-01-08 5 9998 9998 44 -27 8.5 1 2020
9 2020-01-09 9998 9998 0 0 -55 -27.5 1 2020
10 2020-01-10 8 10 0 -10 -99 -54.5 1 2020
# ... with 177 more rows
这是我尝试使用summary和group_by:的原始代码
## first format mso_light2$date from <chr> to an actual 'date'
install.packages("chron")
install.packages("openair")
install.packages("lubridate")
library("openair")
library("chron")
library('lubridate')
options(stringAsFactors = FALSE)
mso_light2$date <- as.Date(mso_light2$date, "%Y-%m-%d")
## Turning all daily temperatures into an average
mso_light2 <- mso_light2 %>% rowwise() %>%
mutate(TAVG = mean(c(TMAX, TMIN), na.rm = T))
## Composing daily data into monthly packages
mso_light2 <- mso_light2 %>%
mutate(month = month(date)) %>%
mutate(year = year(date))
## mso_PRCP <- mso_light2 %>%
## group_by(month, year) %>%
## summarise(PRCP = sum(PRCP))
## mso_SNOW <- mso_light2 %>%
## group_by(month, year) %>%
## summarise(SNOW = sum(SNOW))
## mso_TAVG <- mso_light2 %>%
## group_by(month, year) %>%
## summarise(TAVG = mean(TAVG))
## summarise(SNOW = sum(SNOW)) %>%
## summarise(TAVG = mean(TAVG))
问题是我不知道如何删除我的占位符"9999"&9998',并使它们成为'0。所以我一直在尝试开发一个for循环,这就是我所拥有的:
for(i in 1:length(mso_light2$year[[1]])){
startDate <- as.character(mso_light2$date[1])
startDate <- str_split(startDate, "-")
start_year <- startDate[[1]][1]
start_month <- startDate[[1]][2]
start_day <- startDate[[1]][3]
for(j in 1:length(mso_light2$month)){
mso_monthly <- sapply(mso_light2,
function(x) sum(x[["PRCP"]]),
use.names =
paste(start_year, '-',
start_month, sep = ""))
}
}
请忽略sapply()
。我已经尝试了该系列的所有可能功能,但它们都返回错误消息。
这是我不断得到的错误:
FUN(X[[i]],…(中的错误:未使用的参数(use.names="2020-01"(
sapply
只是我在寻求帮助之前尝试的最后一个函数,谢谢。
我知道您正试图从GHCN下载2020年USW00024153站的数据。
library(tidyverse)
dt_path <- "ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2020.csv.gz"
download.file(dt_path, "2020.csv.gz", mode="wb")
#> ID = 11 character station identification code
#> YEAR/MONTH/DAY = 8 character date in YYYYMMDD format (e.g. 19860529 = May 29, 1986)
#> ELEMENT = 4 character indicator of element type
#> DATA VALUE = 5 character data value for ELEMENT
#> M-FLAG = 1 character Measurement Flag
#> Q-FLAG = 1 character Quality Flag
#> S-FLAG = 1 character Source Flag
#> OBS-TIME = 4-character time of observation in hour-minute format (i.e. 0700 =7:00 am)
#> this list ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/
#> data dictionary https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt
来自这个FTP服务器的数据稍微干净一点,至少日期不包含时间戳。我重用您的列名,因为数据没有标题。还要注意,readr::read_csv()
(和data.table::fread()
(可以很好地处理压缩文件,因此不需要解压缩。
dt_colnms <- c("station", "date", "datatype", "value", "fl_m", "fl_q", "fl_so", "fl_t")
dt <- readr::read_csv("2020.csv.gz", col_names = dt_colnms, col_types = 'cccdcccc')
数据处理步骤包括:
- 过滤您需要的站点,忽略数据集中也存在的风柱
- 透视多个值列(感兴趣的值和标志(
- 平均温度。既然你只有两列,我看不出有什么理由去
rowwise()
- 从字符日期中提取月和年,并转换日期
dt %>%
filter(station=="USW00024153", !str_detect(datatype, "^W")) %>%
pivot_wider(id_cols = "date",
names_from = "datatype",
values_from = c("fl_m", "fl_q","value")) %>%
mutate(value_TAVG=(value_TAVG+value_TAVG)/2,
month=parse_number(substr(date, 5,6)),
year=parse_number(substr(date, 1,4)),
date=as.Date(date, "%Y%m%d"))
现在,您的最后一步是检查fl_m=="0"的行的替换值;T";或者其中fl_q=="0";M〃;。
你本可以在旋转之前完成。然后,转向和总结都将变得更容易:
dt %>%
filter(station=="USW00024153", !str_detect(datatype, "^W")) %>%
mutate(value=ifelse(fl_m=="T"&!is.na(fl_m), 0, value),
value=ifelse(fl_q=="M"&!is.na(fl_q), 0, value)) %>%
pivot_wider(id_cols = "date",
names_from = "datatype",
values_from = "value") %>%
mutate(TAVG=(TMIN+TMAX)/2,
month=parse_number(substr(date, 5,6)),
year=parse_number(substr(date, 1,4)),
date=as.Date(date, "%Y%m%d")) %>%
group_by(month, year) %>%
summarize(AVG_TAVG=mean(TAVG, na.rm = TRUE),
AVG_PRCP=mean(PRCP, na.rm=TRUE),
AVG_SNOW=mean(SNOW, na.rm=TRUE)) %>%
ungroup()
#> # A tibble: 7 x 5
#> month year AVG_TAVG AVG_PRCP AVG_SNOW
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2020 -1.82 6.61 2.58
#> 2 2 2020 -7.60 9.31 11.3
#> 3 3 2020 31.6 1.77 0.0968
#> 4 4 2020 69.9 15.1 3.97
#> 5 5 2020 119. 21.3 0
#> 6 6 2020 155. 21.5 0
#> 7 7 2020 191. 2.55 0