如何在 r 中计算两年的移动平均线

我有一个关于并购(M&A)的大数据框(90万行)。

df有四列：日期(并购完成的时间)，target_nation(哪个国家的公司被合并/收购)，acquiror_nation(哪个国家的公司是收购人)和big_corp(收购方是否是一家大公司，其中TRUE表示该公司很大)。

这是我的 df 示例：

> df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L, 
2002L, 2002L), target_nation = c("Uganda", "Uganda", "Uganda", 
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France", 
"Germany", "France", "France", "Germany", "France", "France", 
"Germany"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE, 
TRUE, TRUE, TRUE)), row.names = c(NA, -8L))
> df 
date target_nation acquiror_nation big_corp_TF
1: 2000        Uganda          France        TRUE
2: 2000        Uganda         Germany       FALSE
3: 2001        Uganda          France        TRUE
4: 2001        Uganda          France       FALSE
5: 2001        Uganda         Germany       FALSE
6: 2002        Uganda          France        TRUE
7: 2002        Uganda          France        TRUE
8: 2002        Uganda         Germany        TRUE

根据这些数据，我想创建一个新变量，表示特定收购国的大公司进行的并购份额，计算2年的平均值。(对于我的实际练习，我将计算 5 年的平均值，但让我们在这里保持简单)。因此，法国的大公司会有一个新的变量，德国的大公司也会有一个新的变量。

到目前为止，我设法做的是1)计算特定target_nation在某一年中的并购总数;2)统计某acquiror_nation大公司在某年某target_nation进行的并购总数。我加入了这两个 df，以方便计算我想要的平均值。这是我使用的代码和生成的新 df：

##counting total rows for target nations
df2 <- df %>%
group_by(date, target_nation) %>%
count(target_nation)
##counting total rows conducted by small or big corps for certain acquiror nations
df3 <- df %>%
group_by(date, target_nation, acquiror_nation) %>%
count(big_corp_TF)
##selecting rows that were conducted by big corps
df33 <- df3 %>%
filter(big_corp_TF == TRUE)
##merging df2 and df33
df4 <- df2 %>%
left_join(df33, by = c("date" = "date", "target_nation" = "target_nation"))
df4 <- as.data.frame(df4)
> df4
date target_nation n.x acquiror_nation big_corp_TF n.y
1 2000        Uganda   2          France        TRUE   1
2 2001        Uganda   3          France        TRUE   1
3 2002        Uganda   3          France        TRUE   2
4 2002        Uganda   3         Germany        TRUE   1

n.x是特定target_nation在某一年进行的并购(行)总数;n.y是特定acquiror_nations的大公司在某一target_nation进行的并购(行)总数。

有了这个新的数据框df4，我现在可以很容易地计算出特定acquiror_nation的大公司在某一年中在特定target_nation进行的并购份额。例如，让我们计算法国的这个份额：

df5 <- df4 %>% 
filter(acquiror_nation == "France") %>%
mutate(France_bigcorp_share_1year = n.y / n.x)
date target_nation n.x acquiror_nation big_corp_TF n.y France_bigcorp_share_1year
1 2000        Uganda   2          France        TRUE   1                  0.5000000
2 2001        Uganda   3          France        TRUE   1                  0.3333333
3 2002        Uganda   3          France        TRUE   2                  0.6666667

但是，我无法弄清楚如何计算特定收购国的大公司进行的并购份额，计算2年的平均值。

这是所需变量的样子：

date target_nation n.x acquiror_nation big_corp_TF n.y France_bigcorp_share_2years
1 2000        Uganda   2          France        TRUE   1                  0.5000000
2 2001        Uganda   3          France        TRUE   1                  0.4000000
3 2002        Uganda   3          France        TRUE   2                  0.5000000

请注意，2000年的份额将保持不变，因为没有前一年使其成为2年平均数;2001年将变为0.4(因为(1+1)/(2+3)=0.4);2002年的份额将变为0.5(因为(1+2)/(3+3)=0.5)。

您对如何编写计算两年平均份额的代码有想法吗？我想我需要在这里使用 for 循环，但我不知道怎么做。任何建议将不胜感激。

编辑：AnilGoyal的代码与示例数据完美配合，但我的实际数据显然更混乱，因此我想知道是否有解决我遇到的问题的方法。

我的实际数据集有时会跳过一年，或者有时不包括前几行中包含的acquiror_nations。请查看我实际数据的更准确样本：

> df_new <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L, 
2002L, 2002L, 2003L, 2003L, 2004L, 2004L, 2004L, 2006L, 2006L
), target_nation = c("Uganda", "Uganda", "Uganda", "Uganda", 
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", 
"Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France", 
"Germany", "France", "France", "Germany", "France", "France", 
"Germany", "Germany", "Germany", "France", "France", "Germany", 
"France", "France"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE, 
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)), row.names = c(NA, 
-15L))
> df_new 
date target_nation acquiror_nation big_corp_TF
1: 2000        Uganda          France     TRUE
2: 2000        Uganda         Germany    FALSE
3: 2001        Uganda          France     TRUE
4: 2001        Uganda          France    FALSE
5: 2001        Uganda         Germany    FALSE
6: 2002        Uganda          France     TRUE
7: 2002        Uganda          France     TRUE
8: 2002        Uganda         Germany     TRUE
9: 2003        Uganda         Germany     TRUE
10: 2003        Uganda         Germany    FALSE
11: 2004        Uganda          France     TRUE
12: 2004        Uganda          France    FALSE
13: 2004        Uganda         Germany     TRUE
14: 2006        Uganda          France     TRUE
15: 2006        Uganda          France     TRUE

注意：2003年没有法国的行;也没有2005年。

如果我运行 Anil 的第一个代码，结果是以下 tibble：

date target_nation acquiror_nation    n1    n2 share
<int> <chr>         <chr>           <dbl> <int> <dbl>
1  2000 Uganda        France              2     1   0.5
2  2001 Uganda        France              3     1   0.4
3  2002 Uganda        France              3     2   0.5
4  2004 Uganda        France              3     1   0.5
5  2006 Uganda        France              2     2   0.6

注：法国没有2003年和2005年的结果;我希望有2003年和2005年的结果(因为我们正在计算两年的平均数，因此我们应该能够得到2003年和2005年的结果)。此外，2006年的份额在现实中是不正确的，因为它应该是1(它应该采用2005年的值(即0)而不是2004年的值来计算平均值)。

我希望能够收到以下 tibble：

date target_nation acquiror_nation    n1    n2 share
<int> <chr>         <chr>           <dbl> <int> <dbl>
1  2000 Uganda        France              2     1   0.5
2  2001 Uganda        France              3     1   0.4
3  2002 Uganda        France              3     2   0.5
4  2003 Uganda        France              2     0   0.4
5  2004 Uganda        France              3     1   0.2
6  2005 Uganda        France              0     0   0.33
7  2006 Uganda        France              2     2   1.0

注意：请注意，2006年的结果也不同(因为我们现在以2005年而不是2004年作为两年平均值)。

您认为有可能找到一种方法来输出所需的tibble吗？我知道这是原始数据的问题：它只是缺少某些数据点。但是，将它们包含在原始数据集中似乎非常不方便;最好将它们包含在中间，例如在计算 N1 和 N2 之后。但是，最方便的方法是什么？

EDIT2：Anil 的新代码适用于上面的数据示例，但在处理更复杂的数据示例(包括多个target_nation)时会遇到不希望的问题。下面是一个更短但更复杂的数据示例：

> df_new_complex <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2003L, 
2003L, 1999L, 2001L, 2002L, 2002L), target_nation = c("Uganda", 
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Mozambique", 
"Mozambique", "Mozambique", "Mozambique"), acquiror_nation = c("France", 
"Germany", "France", "France", "Germany", "Germany", "Germany", 
"Germany", "France", "France", "Germany"), big_corp_TF = c(TRUE, 
FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE
)), row.names = c(NA, -11L))
> df_new_complex 
date target_nation acquiror_nation big_corp_TF
1: 2000        Uganda          France        TRUE
2: 2000        Uganda         Germany       FALSE
3: 2001        Uganda          France        TRUE
4: 2001        Uganda          France       FALSE
5: 2001        Uganda         Germany       FALSE
6: 2003        Uganda         Germany        TRUE
7: 2003        Uganda         Germany       FALSE
8: 1999    Mozambique         Germany       FALSE
9: 2001    Mozambique          France        TRUE
10: 2002    Mozambique          France       FALSE
11: 2002    Mozambique         Germany        TRUE

如您所见，此数据示例包括两个target_nations。Anil的代码，其中param <- c("France", "Germany")，产生以下的tibble：

date target_nation acquiror_nation    n1    n2 share
<dbl> <chr>         <chr>           <dbl> <int> <dbl>
1  1999 Mozambique    France              1     0 0    
2  1999 Mozambique    Germany             1     0 0    
3  1999 Uganda        France              0     0 0    
4  1999 Uganda        Germany             0     0 0    
5  2000 Mozambique    France              0     0 0    
6  2000 Mozambique    Germany             0     0 0    
7  2000 Uganda        France              2     1 0.25 
8  2000 Uganda        Germany             2     0 0.167
9  2001 Mozambique    France              1     1 0.4  
10  2001 Mozambique    Germany             1     0 0.333
11  2001 Uganda        France              3     1 0.333
12  2001 Uganda        Germany             3     0 0.25 
13  2002 Mozambique    France              2     0 0.2  
14  2002 Mozambique    Germany             2     1 0.25 
15  2002 Uganda        France              0     0 0.25 
16  2002 Uganda        Germany             0     0 0.25 
17  2003 Mozambique    France              0     0 0.25 
18  2003 Mozambique    Germany             0     0 0.25 
19  2003 Uganda        France              2     0 0.167
20  2003 Uganda        Germany             2     1 0.25

这里不希望的是，该代码为乌干达创建了1999年，为莫桑比克创建了2003年(后者不是问题)。在1999年，乌干达没有数据样本中显示的投资，因此为此使用数值是没有意义的(它可能有NA，或者根本不存在)。莫桑比克在2003年也没有投资，所以我不想计算莫桑比克当年的份额。

我为此找到了一种解决方法，即我在代码的早期过滤特定的目标国家，就像这样：

correct1 <- df_new_complex %>% 
filter(target_nation == "Mozambique") %>%
mutate(d = 1) %>% ...
#I do the same for another target_nation
correct2 <- df_new_complex %>% 
filter(target_nation == "Uganda") %>%
mutate(d = 1) %>% ...
#I then use rbind
correct <- rbind(correct1, correct2)
#which produces the desired tibble (without a year 2003 for Mozambique and 1999 for Uganda).
> correct 
date target_nation acquiror_nation    n1    n2 share
<dbl> <chr>         <chr>           <dbl> <int> <dbl>
1  1999 Mozambique    France              1     0 0    
2  1999 Mozambique    Germany             1     0 0    
3  2000 Mozambique    France              0     0 0    
4  2000 Mozambique    Germany             0     0 0    
5  2001 Mozambique    France              1     1 1    
6  2001 Mozambique    Germany             1     0 0 
7  2002 Mozambique    France              2     0 0.33 
8  2002 Mozambique    Germany             2     1 0.333
9  2000 Uganda        France              2     1 0.5  
10  2000 Uganda        Germany             2     0 0.25 
11  2001 Uganda        France              3     1 0.286
12  2001 Uganda        Germany             3     0 0.2  
13  2002 Uganda        France              0     0 0.167
14  2002 Uganda        Germany             0     0 0.167
15  2003 Uganda        France              2     0 0    
16  2003 Uganda        Germany             2     1 0.25

有什么更快的方法可以做到这一点？我有一个所需target_nations的列表。也许可以创建一个循环，我先计算一个target_nation，然后再计算另一个;然后重新绑定它们;然后是另一个;然后是 rbind 等。还是有更好的方法？

使用包runner你可以做这样的事情

df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L, 
2002L, 2002L), target_nation = c("Uganda", "Uganda", "Uganda", 
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France", 
                                                                "Germany", "France", "France", "Germany", "France", "France", 
                                                                "Germany"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE, 
                                                                                            TRUE, TRUE, TRUE)), row.names = c(NA, -8L))
library(runner)
library(tidyverse)
df <- df %>% as.data.frame()
param <- 'France'
df %>% 
group_by(date, target_nation) %>%
mutate(n1 = n()) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
#> # A tibble: 3 x 6
#>    date target_nation acquiror_nation    n1    n2 share
#>   <int> <chr>         <chr>           <dbl> <int> <dbl>
#> 1  2000 Uganda        France              2     1   0.5
#> 2  2001 Uganda        France              3     1   0.4
#> 3  2002 Uganda        France              3     2   0.5

即使你可以同时为所有国家做


df %>% 
group_by(date, target_nation) %>%
mutate(n1 = n()) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
group_by(acquiror_nation) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
#> # A tibble: 6 x 6
#> # Groups:   acquiror_nation [2]
#>    date target_nation acquiror_nation    n1    n2 share
#>   <int> <chr>         <chr>           <dbl> <int> <dbl>
#> 1  2000 Uganda        France              2     1 0.5  
#> 2  2000 Uganda        Germany             2     0 0    
#> 3  2001 Uganda        France              3     1 0.4  
#> 4  2001 Uganda        Germany             3     0 0    
#> 5  2002 Uganda        France              3     2 0.5  
#> 6  2002 Uganda        Germany             3     1 0.167

鉴于修改后的场景，您需要做两件事 -

在两个sum_run函数中都包含参数idx = date。这将根据需要更正输出，但不包括缺少行/年的共享。
要包括缺失的年份，您需要如下所示tidyr::complete-

param <- 'France'
df_new %>% 
mutate(d = 1) %>%
complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param) %>%
mutate(share = sum_run(n2, k=2, idx = date)/sum_run(n1, k=2, idx = date))
# A tibble: 7 x 6
date target_nation acquiror_nation    n1    n2 share
<dbl> <chr>         <chr>           <dbl> <int> <dbl>
1  2000 Uganda        France              2     1 0.5  
2  2001 Uganda        France              3     1 0.4  
3  2002 Uganda        France              3     2 0.5  
4  2003 Uganda        France              2     0 0.4  
5  2004 Uganda        France              3     1 0.2  
6  2005 Uganda        France              0     0 0.333
7  2006 Uganda        France              2     2 1

与上面类似，您可以一次为所有国家/地区执行此操作(按group_by过滤)

df_new %>% 
mutate(d = 1) %>%
complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
group_by(acquiror_nation) %>%
mutate(share = sum_run(n2, k=2, idx = date)/sum_run(n1, k=2, idx = date))
# A tibble: 14 x 6
# Groups:   acquiror_nation [2]
date target_nation acquiror_nation    n1    n2 share
<dbl> <chr>         <chr>           <dbl> <int> <dbl>
1  2000 Uganda        France              2     1 0.5  
2  2000 Uganda        Germany             2     0 0    
3  2001 Uganda        France              3     1 0.4  
4  2001 Uganda        Germany             3     0 0    
5  2002 Uganda        France              3     2 0.5  
6  2002 Uganda        Germany             3     1 0.167
7  2003 Uganda        France              2     0 0.4  
8  2003 Uganda        Germany             2     1 0.4  
9  2004 Uganda        France              3     1 0.2  
10  2004 Uganda        Germany             3     1 0.4  
11  2005 Uganda        France              0     0 0.333
12  2005 Uganda        Germany             0     0 0.333
13  2006 Uganda        France              2     2 1    
14  2006 Uganda        Germany             2     0 0

进一步编辑

这很容易。从nesting中删除、target_nation，并在complete之前在其上添加一个group_by。

简单。不是吗

df_new_complex %>%
mutate(d = 1) %>%
group_by(target_nation) %>%
complete(date = seq(min(date), max(date), 1), nesting(acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
group_by(acquiror_nation) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
# A tibble: 16 x 6
# Groups:   acquiror_nation [2]
date target_nation acquiror_nation    n1    n2 share
<dbl> <chr>         <chr>           <dbl> <int> <dbl>
1  1999 Mozambique    France              1     0 0    
2  1999 Mozambique    Germany             1     0 0    
3  2000 Mozambique    France              0     0 0    
4  2000 Mozambique    Germany             0     0 0    
5  2000 Uganda        France              2     1 0.5  
6  2000 Uganda        Germany             2     0 0    
7  2001 Mozambique    France              1     1 0.667
8  2001 Mozambique    Germany             1     0 0    
9  2001 Uganda        France              3     1 0.5  
10  2001 Uganda        Germany             3     0 0    
11  2002 Mozambique    France              2     0 0.2  
12  2002 Mozambique    Germany             2     1 0.2  
13  2002 Uganda        France              0     0 0    
14  2002 Uganda        Germany             0     0 0.5  
15  2003 Uganda        France              2     0 0    
16  2003 Uganda        Germany             2     1 0.5

我注意到你已经删除了原来的问题。

在我的解决方案中，即使没有行 2003 和 2005，我也可以直接计算bigcorp_share_2years。

library(data.table)
df_new <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L, 
2002L, 2002L, 2003L, 2003L, 2004L, 2004L, 2004L, 2006L, 2006L
), target_nation = c("Uganda", "Uganda", "Uganda", "Uganda", 
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", 
"Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France", 
"Germany", "France", "France", "Germany", "France", "France", 
"Germany", "Germany", "Germany", "France", "France", "Germany", 
"France", "France"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE, 
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)), row.names = c(NA, 
-15L))
setDT(df_new)
# NY is the total observation number for two consecutive years.
this = 0
df_new[, NR  := .N,by = date] # NR is each group's length
df_new[, NY  := { last = this; this = last(NR); last + this }, by = date]
# special deal with single year, e.g. 2006.
df_new[, NY  := ifelse( (date - 1) %in% date, NY, NR)]
# snx: count big_corp_TF for acquiror_nation, which will be used to calculate NX
df_new[, snx := sum(big_corp_TF), by = .(date,acquiror_nation)]
# df2: remove column big_crop_TF for unique operation
df2 <- df_new[,c(1:3,5:7)][,unique(.SD)]
# roll count for two consecutive years
df2[, NX := frollsum(snx,2),by=.(acquiror_nation)]
df2[, NX := ifelse( (date - 1) %in% date, NX, snx),acquiror_nation][]
#>     date target_nation acquiror_nation NR NY snx NX
#>  1: 2000        Uganda          France  2  2   1  1
#>  2: 2000        Uganda         Germany  2  2   0  0
#>  3: 2001        Uganda          France  3  5   1  2
#>  4: 2001        Uganda         Germany  3  5   0  0
#>  5: 2002        Uganda          France  3  6   2  3
#>  6: 2002        Uganda         Germany  3  6   1  1
#>  7: 2003        Uganda         Germany  2  5   1  2
#>  8: 2004        Uganda          France  3  5   1  1
#>  9: 2004        Uganda         Germany  3  5   1  2
#> 10: 2006        Uganda          France  2  2   2  2
df2[, bigcorp_share_2years := NX/NY]
df2[, .(date,target_nation,NY,NX,bigcorp_share_2years),by=.(acquiror_nation)]
#>     acquiror_nation date target_nation NY NX bigcorp_share_2years
#>  1:          France 2000        Uganda  2  1            0.5000000
#>  2:          France 2001        Uganda  5  2            0.4000000
#>  3:          France 2002        Uganda  6  3            0.5000000
#>  4:          France 2004        Uganda  5  1            0.2000000
#>  5:          France 2006        Uganda  2  2            1.0000000
#>  6:         Germany 2000        Uganda  2  0            0.0000000
#>  7:         Germany 2001        Uganda  5  0            0.0000000
#>  8:         Germany 2002        Uganda  6  1            0.1666667
#>  9:         Germany 2003        Uganda  5  2            0.4000000
#> 10:         Germany 2004        Uganda  5  2            0.4000000

^{创建于 2021-05-03 由 reprex 软件包 (v2.0.0)}

相关内容

最新更新

热门标签：