r语言 - 取两个日期之间的差值 dplyr



我有以下数据:

# A tibble: 7,971 x 10
   symbol date        open  high   low close    volume adjusted start_date end_date  
   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl> <date>     <date>    
 1 AAPL   2009-01-02  12.3  13.0  12.2  13.0 186503800     11.4 2009-07-31 2010-06-30
 2 AAPL   2009-01-05  13.3  13.7  13.2  13.5 295402100     11.8 2009-07-31 2010-06-30
 3 AAPL   2009-01-06  13.7  13.9  13.2  13.3 322327600     11.6 2009-07-31 2010-06-30
 4 AAPL   2009-01-07  13.1  13.2  12.9  13.0 188262200     11.4 2009-07-31 2010-06-30
 5 AAPL   2009-01-08  12.9  13.3  12.9  13.2 168375200     11.6 2009-07-31 2010-06-30
 6 AAPL   2009-01-09  13.3  13.3  12.9  12.9 136711400     11.3 2009-07-31 2010-06-30
 7 AAPL   2009-01-12  12.9  13.0  12.5  12.7 154429100     11.1 2009-07-31 2010-06-30
 8 AAPL   2009-01-13  12.6  12.8  12.3  12.5 199599400     11.0 2009-07-31 2010-06-30
 9 AAPL   2009-01-14  12.3  12.5  12.1  12.2 255416000     10.7 2009-07-31 2010-06-30
10 AAPL   2009-01-15  11.5  12.0  11.4  11.9 457908500     10.4 2009-07-31 2010-06-30

我正在尝试按符号、开始日期和结束日期进行分组,并取开始日期的第一个观察和结束日期的最后一个观察之间的差异。我似乎无法让它工作。

即取开始日期的"关闭"和结束日期的"关闭"之差。

任何帮助都会很棒,谢谢!

syms <- c("AAPL", "MSFT", "GOOG")
library(tidyquant)
data <- tq_get(syms)

data <- data %>%
  mutate(         start_date = paste(year(date %m+% months(6)), "07", "31", sep = "-"), # note this is the start_date for when we calculate the returns - we will have bought this portfolio on the 1st July but we get returns on the 31st
                  end_date = paste(year(date %m+% months(18)), "06", "30", sep = "-"),
                  start_date = as.Date(start_date),
                  end_date = as.Date(end_date))

我的尝试...

data %>%
  group_by(symbol, start_date, end_date) %>%
  summarise(diff = diff(close))

编辑:

我正在尝试按symbol分组,然后采取start_date - end_date.因此,首先,我应该按符号分组并将date列过滤到start_dateend_date值之间。即我只对start_date天和end_date天的"收盘价"感兴趣(这是固定的(。然后,只需取start_dateend_date的收盘价之间的差额。所以大多数股价数据在这里都是无用的,我只对start_date的收盘价感兴趣,然后end_date取这两个值之间的差值。

我认为您要查找的是减去first并为每个组last close

library(dplyr)
data %>%
  group_by(symbol, start_date, end_date) %>%
  summarise(diff = first(close) - last(close))
#   symbol start_date end_date     diff
#   <chr>  <date>     <date>      <dbl>
# 1 AAPL   2009-07-31 2010-06-30  -7.38
# 2 AAPL   2010-07-31 2011-06-30 -15.5 
# 3 AAPL   2011-07-31 2012-06-30 -12.5 
# 4 AAPL   2012-07-31 2013-06-30 -34.4 
# 5 AAPL   2013-07-31 2014-06-30  28.0 
# 6 AAPL   2014-07-31 2015-06-30 -34.5 
# 7 AAPL   2015-07-31 2016-06-30 -31.9 
# 8 AAPL   2016-07-31 2017-06-30  31   
# 9 AAPL   2017-07-31 2018-06-30 -48.1 
#10 AAPL   2018-07-31 2019-06-30 -41.6 
# … with 26 more rows

另一种写法可能是

data %>% 
  group_by(symbol, start_date, end_date) %>% 
  summarise(diff = close[1L] - close[n()])

或者也可以使用基本 R aggregate

aggregate(close~symbol +start_date + end_date,data,function(x) x[1L] - x[length(x)])

你可以采用这种方法...

# create df of unique symbol, start, and end date combos
df1 <- df %>% distinct(symbol,start_date,end_date)
# join original data that match the desired start/end dates
df1 <- df %>% select(start_close=close,symbol,start_date=date) %>% left_join(df1,.)
df1 <- df %>% select(end_close=close,symbol,end_date=date) %>% left_join(df1,.)
# find difference in close values
df1 %>% mutate(diff=end_close - start_close)
# A tibble: 36 x 6
   symbol start_date end_date   start_close end_close  diff
   <chr>  <date>     <date>           <dbl>     <dbl> <dbl>
 1 AAPL   2009-07-31 2010-06-30        23.3      35.9  12.6
 2 AAPL   2010-07-31 2011-06-30        NA        48.0  NA  
 3 AAPL   2011-07-31 2012-06-30        NA        NA    NA  
 4 AAPL   2012-07-31 2013-06-30        87.3      NA    NA  
 5 AAPL   2013-07-31 2014-06-30        64.6      92.9  28.3
 6 AAPL   2014-07-31 2015-06-30        95.6     125.   29.8
 7 AAPL   2015-07-31 2016-06-30       121.       95.6 -25.7
 8 AAPL   2016-07-31 2017-06-30        NA       144.   NA  
 9 AAPL   2017-07-31 2018-06-30       149.       NA    NA  
10 AAPL   2018-07-31 2019-06-30       190.       NA    NA  
# ... with 26 more rows

NA,因为并非每个开始/结束日期都在原始日期列中。

最新更新