我在下面创建了这个虚拟数据集,看起来与我正在处理的实际数据相似。
library(tidyverse)
factor <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "A", "B", "C", "D", "E", "F", "G", "H", "I")
year <- c("2019", "2019", "2019", "2019", "2019", "2019", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020")
sales <- c(5, 4, 2, 1, 6, 4, 7, 8, 5, 1, 6, 7, 2, 4, 1, 6, 7, 5)
df <- tibble(factor, year, sales)
df
# A tibble: 18 x 3
factor year sales
<chr> <chr> <dbl>
1 A 2019 5
2 B 2019 4
3 C 2019 2
4 D 2019 1
5 E 2019 6
6 F 2019 4
7 G 2020 7
8 H 2020 8
9 I 2020 5
10 A 2020 1
我想对每个因素的销售列执行数学运算。例如,%_change(factor_A) = sales_2020/sales_2019, %_change(factor_B) = sales_2020/sales_2019,以此类推。在我的实际数据集中,任何给定的因素可以在任何给定的年份出现任何次数。
考虑到每个因素在任何给定年份可以出现多次,我意识到我可能需要按因素和年份进行分组(即使它们在我的示例中每年只显示一次)。所以:
df %>%
group_by(factor, year) %>%
summarise(
n = sum(sales) # this is to account for factors showing up multiple times within the same year
)
我知道我可以更宽的枢轴,这样每年都有自己的列,"水平"执行数学,然后枢轴回到长,但我想知道是否有一个更优雅的解决方案,也许是通过循环因子。
我希望有人能给我一些建议。
你可以这样做。group_by
因子,然后比较各因子2020年sales
的(总求和)值与2019年sales
的(总)值。
df %>%
group_by(factor) %>%
mutate(pct_change = sum(sales[year == "2020"]) / sum(sales[year == "2019"]))
输出# A tibble: 18 × 4
# Groups: factor [9]
factor year sales pct_change
<chr> <chr> <dbl> <dbl>
1 A 2019 5 0.2
2 B 2019 4 1.5
3 C 2019 2 3.5
4 D 2019 1 2
5 E 2019 6 0.667
6 F 2019 4 0.25
7 G 2019 7 0.857
8 H 2019 8 0.875
9 I 2019 5 1
10 A 2020 1 0.2
11 B 2020 6 1.5
12 C 2020 7 3.5
13 D 2020 2 2
14 E 2020 4 0.667
15 F 2020 1 0.25
16 G 2020 6 0.857
17 H 2020 7 0.875
18 I 2020 5 1
我稍微改变了你的数据,因为有些因素没有两个时间点:
library(tidyverse)
factor <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "A", "B", "C", "D", "E", "F", "G", "H", "I")
year <- c("2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020")
sales <- c(5, 4, 2, 1, 6, 4, 7, 8, 5, 1, 6, 7, 2, 4, 1, 6, 7, 5)
df <- tibble(factor, year, sales)
df