通过R中的id将一组的观察转移到下一组



假设我有一个这样的数据帧:

contracts
Dates Last.Price Last.Price.1  id        carry
1  1998-11-30      94.50        98.50  QS -0.040609137
2  1998-11-30      31.32        32.13  HO -0.025210084
3  1998-12-31      95.50        98.00  QS -0.025510204
4  1998-12-31      34.00        34.28  HO -0.008168028
5  1999-01-29     100.00       100.50  QS -0.004975124
6  1999-01-29      33.16        33.42  HO -0.007779773
7  1999-02-26     100.25       100.25  QS  0.000000000
8  1999-02-26      32.29        32.37  HO -0.002471424
9  1999-02-26      10.88        11.00  CO -0.010909091
10 1999-03-31     131.50       130.75  QS  0.005736138
11 1999-03-31      44.68        44.00  HO  0.015454545
12 1999-03-31      15.24        15.16  CO  0.005277045

我想计算每个月每个id的权重。我有一个函数可以做到这一点。我使用dplyr来实现这一点:

library(dplyr)
library(lubridate)
contracts <- contracts %>%
mutate(Dates = ymd(Dates)) %>%
group_by(Dates) %>%
mutate(weights = weight(carry))

它给出:

contracts 
Dates Last.Price Last.Price.1  id        carry     weights
1  1998-11-30      94.50        98.50  QS -0.040609137 0.616979910
2  1998-11-30      31.32        32.13  HO -0.025210084 0.383020090
3  1998-12-31      95.50        98.00  QS -0.025510204 0.757468623
4  1998-12-31      34.00        34.28  HO -0.008168028 0.242531377
5  1999-01-29     100.00       100.50  QS -0.004975124 0.390056023
6  1999-01-29      33.16        33.42  HO -0.007779773 0.609943977
7  1999-02-26     100.25       100.25  QS  0.000000000          NA
8  1999-02-26      32.29        32.37  HO -0.002471424 0.184703218
9  1999-02-26      10.88        11.00  CO -0.010909091 0.815296782
10 1999-03-31     131.50       130.75  QS  0.057361377 0.057361377
11 1999-03-31      44.68        44.00  HO  0.015454545 0.015454545
12 1999-03-31      15.24        15.16  CO  0.005277045 0.005277045

现在我想要滞后权重,这样11月计算的权重就可以应用于12月。所以我基本上想按组移动权重列,组是日期。因此,11月的数值最终会变成12月的数值,以此类推

现在,我还希望偏移按id匹配,这样,如果包含新的id,则id首次出现的组将在滞后列中具有NA。

所需输出如下:

desired
Dates Last.Price Last.Price.1  id        carry     weights          w
1  1998-11-30      94.50        98.50  QS -0.040609137 0.616979910         NA
2  1998-11-30      31.32        32.13  HO -0.025210084 0.383020090         NA
3  1998-12-31      95.50        98.00  QS -0.025510204 0.757468623 0.61697991
4  1998-12-31      34.00        34.28  HO -0.008168028 0.242531377 0.38302009
5  1999-01-29     100.00       100.50  QS -0.004975124 0.390056023 0.75746862
6  1999-01-29      33.16        33.42  HO -0.007779773 0.609943977 0.24253138
7  1999-02-26     100.25       100.25  QS  0.000000000          NA 0.39005602
8  1999-02-26      32.29        32.37  HO -0.002471424 0.184703218 0.60994398
9  1999-02-26      10.88        11.00  CO -0.010909091 0.815296782         NA
10 1999-03-31     131.50       130.75  QS  0.057361377 0.057361377         NA
11 1999-03-31      44.68        44.00  HO  0.015454545 0.015454545 0.18470322
12 1999-03-31      15.24        15.16  CO  0.005277045 0.005277045 0.81529678

注意到1999年2月。CO有NA,因为它在2月份首次出现。

现在看看1999年3月,CO有来自2月的值,QS有NA只是因为2月的数值是NA(由于除以0(。

这能做到吗?

数据:


contracts  <- read.table(text = "Dates, Last.Price, Last.Price.1, id,carry
1998-11-30, 94.500, 98.500, QS, -0.0406091371
1998-11-30, 31.320, 32.130, HO, -0.0252100840
1998-12-31, 95.500, 98.000, QS, -0.0255102041
1998-12-31, 34.000, 34.280, HO, -0.0081680280
1999-01-29, 100.000, 100.500, QS, -0.0049751244
1999-01-29, 33.160, 33.420, HO, -0.0077797726
1999-02-26, 100.250, 100.250, QS,  0.0000000000
1999-02-26, 32.290, 32.370, HO, -0.0024714242
1999-02-26, 10.880, 11.000, CO, -0.0109090909
1999-03-31, 131.500, 130.750, QS,  0.0057361377
1999-03-31, 44.680, 44.000, HO,  0.0154545455
1999-03-31, 15.240, 15.160, CO,  0.0052770449", sep = ",", header = T)

desired <- read.table(text = "Dates,Last.Price,Last.Price.1,id,carry,weights,w
1998-11-30,94.5,98.5, QS,-0.0406091371,0.616979909839741,NA
1998-11-30,31.32,32.13, HO,-0.025210084,0.383020090160259,NA
1998-12-31,95.5,98, QS,-0.0255102041,0.757468623182272,0.616979909839741
1998-12-31,34,34.28, HO,-0.008168028,0.242531376817728,0.383020090160259
1999-01-29,100,100.5, QS,-0.0049751244,0.390056023188584,0.757468623182272
1999-01-29,33.16,33.42, HO,-0.0077797726,0.609943976811416,0.242531376817728
1999-02-26,100.25,100.25, QS,0,NA,0.390056023188584
1999-02-26,32.29,32.37, HO,-0.0024714242,0.184703218189261,0.609943976811416
1999-02-26,10.88,11, CO,-0.0109090909,0.815296781810739,NA
1999-03-31,131.5,130.75, QS,0.057361377,0.057361377,NA
1999-03-31,44.68,44, HO,0.0154545455,0.0154545455,0.184703218189261
1999-03-31,15.24,15.16, CO,0.0052770449,0.0052770449,0.815296782", sep = ",", header = TRUE)

权重函数:

weight <- function(vec) {
neg <- which(vec<0)
w <- vec
w[neg] <- vec[vec<0] / sum(vec[vec<0])
w[-neg] <- vec[vec>=0] / sum(vec[vec>=0])
w
}
contracts %>%
group_by(Dates) %>%
mutate(weights = weight(carry)) %>%
arrange(Dates) %>%
group_by(id) %>%
mutate(w = dplyr::lag(weights)) %>%
ungroup()
# # A tibble: 12 x 7
#    Dates      Last.Price Last.Price.1 id       carry   weights       w
#    <chr>           <dbl>        <dbl> <chr>    <dbl>     <dbl>   <dbl>
#  1 1998-11-30       94.5         98.5 " QS" -0.0406    0.617    NA    
#  2 1998-11-30       31.3         32.1 " HO" -0.0252    0.383    NA    
#  3 1998-12-31       95.5         98   " QS" -0.0255    0.757     0.617
#  4 1998-12-31       34           34.3 " HO" -0.00817   0.243     0.383
#  5 1999-01-29      100          100.  " QS" -0.00498   0.390     0.757
#  6 1999-01-29       33.2         33.4 " HO" -0.00778   0.610     0.243
#  7 1999-02-26      100.         100.  " QS"  0       NaN         0.390
#  8 1999-02-26       32.3         32.4 " HO" -0.00247   0.185     0.610
#  9 1999-02-26       10.9         11   " CO" -0.0109    0.815    NA    
# 10 1999-03-31      132.         131.  " QS"  0.00574   0.00574 NaN    
# 11 1999-03-31       44.7         44   " HO"  0.0155    0.0155    0.185
# 12 1999-03-31       15.2         15.2 " CO"  0.00528   0.00528   0.815

注:

  • 我使用了dplyr::lag而不仅仅是lag,因为可能与stats::lag混淆,后者的行为与dplyr::lag明显不同。虽然大多数时候它都会很好地工作,但它会一直工作到不。。。它通常不会警告你:-(

  • 无论月份如何,这都滞后Dates。我假设您确信Dates总是非常频繁的。如果你认为有可能存在差距(按行滞后是不正确的(,那么你需要将年/月划分为一个新的领域,并自行加入,而不是进行lag

最新更新