按粒度级别对数据进行子集化,并将函数应用于 R 中的每个数据帧



好吧,这个问题相当长和复杂(至少对我来说(,我已经尽力使这个问题尽可能清晰、有条理和详细,所以请耐心等待......

----------------------------------------------------------------------

我目前在将函数应用于数据中的子集时有一个过于手动的过程,我想弄清楚如何使代码更有效率。用一个例子来描述这个问题是最容易的:

我的数据 (myData( 中的变量:2017 年、2018 年和 2019 年的 GDP,分为 4 个粒度级别:大陆、国家、州(或省(和城市。 (注:GDP数字是任意的;仅用于简化计算(

我的数据:

|------|---------------|---------|------------|-------------|------|
| Year | Continent     | Country | State      | City        | GDP  |
|------|---------------|---------|------------|-------------|------|
| 2019 | North America | Canada  | Alberta    | Edmonton    | 13   |
| 2018 | North America | Canada  | Alberta    | Calgary     | 9    |
| 2018 | North America | Canada  | Alberta    | Edmonton    | 3    | 
| 2018 | Asia          | India   | Bihar      | Patna       | 19   |
| 2018 | Asia          | India   | Bihar      | Gaya        | 8    |
| 2017 | Asia          | India   | Bihar      | Patna       | 22   | 
| 2019 | Asia          | India   | Bihar      | Gaya        | 19   |
| 2019 | Asia          | India   | Bihar      | Patna       | 16   |
| 2019 | North America | USA     | California | San Diego   | 23   |
| 2017 | North America | USA     | California | Los Angeles | 18   |
| 2018 | North America | USA     | California | Los Angeles | 25   |
| 2018 | North America | USA     | Florida    | Tampa       | 14   |
| 2019 | North America | USA     | Florida    | Miami       | 19   |
| 2018 | Asia          | China   | Guangdong  | Shenzhen    | 29   |
| 2017 | Asia          | China   | Guangdong  | Shenzhen    | 26   |
| 2019 | Asia          | China   | Guangdong  | Shenzhen    | 33   |
| 2019 | Asia          | China   | Guangdong  | Guangzhou   | 20   |
| 2018 | Asia          | China   | Guangdong  | Guangzhou   | 19   |
| 2018 | North America | Canada  | Quebec     | Montreal    | 11   |
| 2019 | North America | Canada  | Quebec     | Montreal    | 7    |
| 2019 | Asia          | China   | Shandong   | Yantai      | 30   |
| 2019 | Asia          | China   | Shandong   | Jinan       | 16   |
| 2018 | Asia          | China   | Shandong   | Yantai      | 17   |
| 2018 | Asia          | China   | Shandong   | Jinan       | 11   |
| 2019 | Asia          | India   | U.P.       | Allahabad   | 21   |
| 2018 | Asia          | India   | U.P.       | Agra        | 15   |
| 2018 | Asia          | India   | U.P.       | Allahabad   | 13   |
| 2019 | Asia          | India   | U.P.       | Agra        | 18   |
|------|---------------|---------|------------|-------------|------|

总体目标是计算 GDP 分位数 (1 = 0-25%, 2 = 25%-50%,...等(在不同的粒度级别。这正是我正在寻找的:

  • 每年的分位数;(3 年的整个数据集的子集(
  • 每个大陆的分位数;(各大洲的子集数据(
  • 每个国家的分位数;(按大洲和国家/地区划分的子集数据(
  • 每个州的分位数;(按洲、国家和州划分的子集数据(
  • 每个城市的分位数;(按洲、国家、州、省和城市划分的子集数据(


我目前在这个过程中有两个步骤:

  1. 每个级别的子集数据。
  2. 计算每个子集的分位数(基于 GDP 值(。


我们通过在每个级别上求和/添加 GDP 来细分。(注意:当我们向下移动到级别 5 时,此步骤将生成行越来越少的数据帧。这是我所做的,它是相当手动和重复的,所以我想找到一个更好的方法:

Level_1.Year <- aggregate(
GDP ~ 
Year + 
Continent + 
Country + 
State.Province + 
City, 
FUN = sum, 
data = myData)
Level_2.Continent <- aggregate(
GDP ~ 
Continent + 
Country + 
State.Province + 
City, 
FUN = sum, 
data = myData)
Level_3.Country <- aggregate(
GDP ~ 
Country + 
State.Province + 
City, 
FUN = sum, 
data = myData)
Level_4.State.Province <- aggregate(
GDP ~ 
State.Province + 
City, 
FUN = sum, 
data = myData)
Level_5.City <- aggregate(
GDP ~ 
City, 
FUN = sum, 
data = myData)
----------------------------------------------------------------------

所以现在我们有了子集,我们计算每个子集的分位数。由于它们都是不同的长度并且没有相同的变量,因此我为每个子集采用了手动/重复计算(再次...(:

Level_1.Year_quantiles <- Level_1.Year %>% 
group_by(Year) %>% 
mutate(Quantile = cut(GDP,
breaks = quantile(GDP, 
c(0, 0.25, 0.5, 0.75, 1)), 
labels = 1:4, 
include.lowest = TRUE))
Level_2.Continent_quantiles <- Level_2.Continent %>% 
group_by(Continent) %>% 
mutate(Quantile = cut(GDP,
breaks = quantile(GDP, 
c(0, 0.25, 0.5, 0.75, 1)), 
labels = 1:4, 
include.lowest = TRUE))
Level_3.Country_quantiles <- Level_3.Country %>% 
group_by(Country) %>% 
mutate(Quantile = cut(GDP,
breaks = quantile(GDP, 
c(0, 0.25, 0.5, 0.75, 1)), 
labels = 1:4, 
include.lowest = TRUE))
. 
.
.
# All the way through Level_5.City; I think you get the point. 
----------------------------------------------------------------------

有没有办法 (1( 以更有效的方式对每个级别进行子集,然后 (2( 将每个子集存储在数据帧列表中,然后 (3( 将分位数添加到列表中的每个数据帧?

如果有更好的方法来完成整个过程,请告诉我!另外,如果您有任何意见或建议,我很想听听。

考虑一个应用族解决方案,即处理列表中所有处理的lapplyby(包装器到tapply(和Map(包装器到mapply

(:
agg_factors <- c("City", "State", "Country", "Continent", "Year")
# NAMED LIST OF DATA FRAMES WHERE FORMULA DYNAMICALLY BUILT AND PASS INTO aggregate()
agg_df_list <- setNames(lapply(seq_along(agg_factors), function(i) {
agg_formula <- as.formula(paste("GDP ~", paste(agg_factors[1:i], collapse=" + ")))
aggregate(agg_formula, myData, FUN=sum)
}), agg_factors)
# FUNCTION TO CALL by() TO RUN FUNCTION ON EACH SUBSET TO BIND TOGETHER AT END
proc_quantiles <- function(df, nm) {
dfs <- by(df, df[[nm]], function(sub) 
transform(sub,
Quantile = tryCatch(cut(GDP,
breaks = quantile(GDP, c(0, 0.25, 0.5, 0.75, 1)), 
labels = 1:4, 
include.lowest = TRUE), 
error = function(e) NA)
)
) 
do.call(rbind, unname(dfs))      
}
# ELEMENTWISE LOOP THROUGH DFs AND CORRESPONDING NAMES
quantile_df_list <- Map(proc_quantiles, agg_df_list, names(agg_df_list))

输出

head(quantile_df_list$City)
#        City GDP Quantile
# 1      Agra  33       NA
# 2 Allahabad  34       NA
# 3   Calgary   9       NA
# 4  Edmonton  16       NA
# 5      Gaya  27       NA
# 6 Guangzhou  39       NA
head(quantile_df_list$State)
#          City      State GDP Quantile
# 1     Calgary    Alberta   9        1
# 2    Edmonton    Alberta  16        4
# 3        Gaya      Bihar  27        1
# 4       Patna      Bihar  57        4
# 5 Los Angeles California  43        4
# 6   San Diego California  23        1
head(quantile_df_list$Country)
#        City     State Country GDP Quantile
# 1   Calgary   Alberta  Canada   9        1
# 2  Edmonton   Alberta  Canada  16        2
# 3  Montreal    Quebec  Canada  18        4
# 4 Guangzhou Guangdong   China  39        2
# 5  Shenzhen Guangdong   China  88        4
# 6     Jinan  Shandong   China  27        1
head(quantile_df_list$Continent)
#        City     State Country Continent GDP Quantile
# 1 Guangzhou Guangdong   China      Asia  39        3
# 2  Shenzhen Guangdong   China      Asia  88        4
# 3     Jinan  Shandong   China      Asia  27        1
# 4    Yantai  Shandong   China      Asia  47        3
# 5      Gaya     Bihar   India      Asia  27        1
# 6     Patna     Bihar   India      Asia  57        4
head(quantile_df_list$Year)
#          City      State Country     Continent Year GDP Quantile
# 1    Shenzhen  Guangdong   China          Asia 2017  26        4
# 2       Patna      Bihar   India          Asia 2017  22        2
# 3 Los Angeles California     USA North America 2017  18        1
# 4   Guangzhou  Guangdong   China          Asia 2018  19        3
# 5    Shenzhen  Guangdong   China          Asia 2018  29        4
# 6       Jinan   Shandong   China          Asia 2018  11        1

首先,澄清一下:你所说的子集是分组摘要。有关详细信息,请参阅"聚合"。其次,你的三个问题的答案都是肯定的。第三,您的 1 级摘要等效于您的数据框。

当您使用aggregate()时,我将首先说明如何使用aggregate()获得分组摘要的列表:

library(tidyverse)
formula_list <- 
list(
GDP ~ Year + Continent + Country + State.Province + City,
GDP ~ Continent + Country + State.Province + City, 
GDP ~ Country + State.Province + City, 
GDP ~ State.Province + City,
GDP ~ City
)
summaries <- formula_list %>% 
map( ~ aggregate(.x, FUN = sum, data = myData))

也可以用完全基于dplyr的方法替换aggregate()。这样做的好处是取代了臭名昭著的低效aggregate()。缺点是,我们将不得不处理商量,这是一个更高级的主题(有关更多信息,请咨询vignette("programming")(。

var_combs <- list(
vars(Year, Continent, Country, State.Province, City),
vars(Continent, Country, State.Province, City),
vars(Country, State.Province, City),
vars(State.Province, City),
vars(City)) 
summaries <- var_combs %>% 
map(~ myData %>% 
group_by(!!!.x) %>% 
summarize(GDP = sum(GDP)))

接下来是将用于计算四分位数的代码应用于列表的每个元素。由于您还改变了分组变量,我们需要迭代两个列表,因此我们将使用purrr::map2()

grp_var <- list(
vars(Year),
vars(Continent),
vars(Country),
vars(State.Province),
vars(City)
)
map2(summaries[1:3], 
grp_var[1:3], 
~ .x %>%  
group_by(!!!.y) %>% 
mutate(Quantile = cut(GDP,
breaks = quantile(GDP, c(0, 0.25, 0.5, 0.75, 1)), 
labels = 1:4, 
include.lowest = TRUE))
)

您会注意到,我必须将列表子集化为仅前三个元素。如果其中一个组只有一个观测值,则您编写的用于计算四分位数的代码将失败(这是有道理的:您无法计算一个样本的四分位数(。五个元素中的最后一个始终如此,因为根据定义,每个组只包含一个元素。如果每组只有两个或三个观测值,则您的结果是否特别有意义也是值得怀疑的。

数据:

myData <- structure(list(
Year = c(2019, 2019, 2018, 2019, 2019, 2018, 2019, 
2018, 2018, 2018, 2018, 2018, 2018, 2017, 2017, 2019, 2018, 2019, 
2019, 2018, 2019, 2017, 2019, 2018, 2018, 2018, 2019, 2019), 
Continent = c("North America", "Asia", "Asia", "North America", 
"Asia", "North America", "Asia", "North America", "Asia", 
"North America", "Asia", "Asia", "Asia", "North America", 
"Asia", "North America", "Asia", "North America", "Asia", 
"North America", "Asia", "Asia", "Asia", "North America", 
"Asia", "Asia", "Asia", "Asia"), 
Country = c("Canada", "India", "India", "USA", "China", "USA", "China", 
"Canada", "China", "Canada", "India", "India", "China", 
"USA", "China", "USA", "India", "Canada", "China", "USA", 
"China", "India", "India", "Canada", "China", "China", 
"India", "India"), 
State.Province = c("Alberta", "Uttar Pradesh", "Bihar", "California", 
"Shandong", "Florida", "Shandong", "Quebec", "Guangdong", 
"Alberta", "Uttar Pradesh", "Bihar", "Shandong", 
"California", "Guangdong", "Florida", "Uttar Pradesh", 
"Quebec", "Guangdong", "California", "Guangdong", "Bihar", 
"Bihar", "Alberta", "Shandong", "Guangdong", "Uttar Pradesh", 
"Bihar"), 
City = c("Edmonton", "Allahabad", "Patna", "Los Angeles", "Yantai", "Miami", 
"Jinan", "Montreal", "Shenzhen", "Calgary", "Agra", "Gaya", "Yantai", 
"Los Angeles", "Shenzhen", "Miami", "Allahabad", "Montreal", 
"Shenzhen", "Los Angeles", "Guangzhou", "Patna", "Gaya", "Edmonton", 
"Jinan", "Guangzhou", "Agra", "Patna"), 
GDP = c(13, 21, 19, 23, 30, 14, 16, 11, 29, 9, 15, 8, 17, 18, 26, 19, 13, 7, 
33, 25, 20, 22, 19, 3, 11, 19, 18, 16)), 
class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), 
row.names = c(NA, -28L), 
spec = structure(list(cols = list(Year = structure(list(), class = c("collector_double", "collector")), 
Continent = structure(list(), class = c("collector_character", "collector")), 
Country = structure(list(), class = c("collector_character", "collector")), 
State.Province = structure(list(), class = c("collector_character", "collector")), 
City = structure(list(), class = c("collector_character", "collector")), 
GDP = structure(list(), class = c("collector_double", "collector"))), 
default = structure(list(), class = c("collector_guess", "collector")), 
skip = 2), 
class = "col_spec"))

最新更新