r-因子水平计算



我在下面创建了这个虚拟数据集,看起来与我正在处理的实际数据相似。

library(tidyverse)
factor <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "A", "B", "C", "D", "E", "F", "G", "H", "I")
year <- c("2019", "2019", "2019", "2019", "2019", "2019", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020")
sales <- c(5, 4, 2, 1, 6, 4, 7, 8, 5, 1, 6, 7, 2, 4, 1, 6, 7, 5)
df <- tibble(factor, year, sales) 
df
# A tibble: 18 x 3
factor year  sales
<chr>  <chr> <dbl>
1 A      2019      5
2 B      2019      4
3 C      2019      2
4 D      2019      1
5 E      2019      6
6 F      2019      4
7 G      2020      7
8 H      2020      8
9 I      2020      5
10 A      2020      1

我想对每个因素的销售列执行数学运算。例如,%_change(factor_A) = sales_2020/sales_2019, %_change(factor_B) = sales_2020/sales_2019,以此类推。在我的实际数据集中,任何给定的因素可以在任何给定的年份出现任何次数。

考虑到每个因素在任何给定年份可以出现多次,我意识到我可能需要按因素和年份进行分组(即使它们在我的示例中每年只显示一次)。所以:

df %>%
group_by(factor, year) %>%
summarise(
n = sum(sales) # this is to account for factors showing up multiple times within the same year
)

我知道我可以更宽的枢轴,这样每年都有自己的列,"水平"执行数学,然后枢轴回到长,但我想知道是否有一个更优雅的解决方案,也许是通过循环因子。

我希望有人能给我一些建议。

你可以这样做。group_by因子,然后比较各因子2020年sales的(总求和)值与2019年sales的(总)值。

df %>% 
group_by(factor) %>% 
mutate(pct_change = sum(sales[year == "2020"]) / sum(sales[year == "2019"]))

输出
# A tibble: 18 × 4
# Groups:   factor [9]
factor year  sales pct_change
<chr>  <chr> <dbl>      <dbl>
1 A      2019      5      0.2  
2 B      2019      4      1.5  
3 C      2019      2      3.5  
4 D      2019      1      2    
5 E      2019      6      0.667
6 F      2019      4      0.25 
7 G      2019      7      0.857
8 H      2019      8      0.875
9 I      2019      5      1    
10 A      2020      1      0.2  
11 B      2020      6      1.5  
12 C      2020      7      3.5  
13 D      2020      2      2    
14 E      2020      4      0.667
15 F      2020      1      0.25 
16 G      2020      6      0.857
17 H      2020      7      0.875
18 I      2020      5      1    

我稍微改变了你的数据,因为有些因素没有两个时间点:

library(tidyverse)
factor <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "A", "B", "C", "D", "E", "F", "G", "H", "I")
year <- c("2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020")
sales <- c(5, 4, 2, 1, 6, 4, 7, 8, 5, 1, 6, 7, 2, 4, 1, 6, 7, 5)
df <- tibble(factor, year, sales) 
df

最新更新