评估时间相关回报率以创建Pandas DataFrame



假设我有一个Pandas数据帧,如下所示:

+------------+--------+
|    Date    | Price  |
+------------+--------+
| 2021-07-30 | 438.51 |
| 2021-08-02 | 437.59 |
| 2021-08-03 | 441.15 |
| 2021-08-04 | 438.98 |
+------------+--------+

上述数据帧可以使用以下代码生成:

data = {'Date': ['2021-07-30', '2021-08-02', '2021-08-03', '2021-08-04'],
'Price': [438.51, 437.59, 441.15, 438.98]
}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
normalisation_days = 365.25
compounding_days = 365.25

对于给定的时间序列,我想计算与时间相关的rate_of_return,这里的问题是确定实现rate_of_return的最佳或最差值的时间段。

可以简单地在所有可能的组合上计算rate_of_return,然后创建包含period_startperiod_endrate_of_return的数据帧,并按降序(最佳(或升序(最差(排序,然后排除存在重叠的任何时段。

rate_of_return = ((period_end_price/period_start_price)^(compounding_days/(days_in_between))-1 * (normalisation_days/compounding_days)

在上面的数据帧上,我使用下面的代码计算了rate_of_return

df['rate_of_return_l1'] = ((((df.Price /
df.Price[0]) **
(compounding_days /
(df.Date - df.Date[0]).dt.days) - 1) *
(normalisation_days /
compounding_days)))
df['rate_of_return_l1'].iloc[0] = np.nan
df['rate_of_return_l2'] = ((((df.Price /
df.Price[1]) **
(compounding_days /
(df.Date - df.Date[1]).dt.days) - 1) *
(normalisation_days /
compounding_days)))
df['rate_of_return_l2'].iloc[:2] = np.nan
df['rate_of_return_l3'] = ((((df.Price /
df.Price[2]) **
(compounding_days /
(df.Date - df.Date[2]).dt.days) - 1) *
(normalisation_days /
compounding_days)))
df['rate_of_return_l3'].iloc[:3] = np.nan

根据结果,最佳/最坏情况时期如下

+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-02   | 2021-08-03 |    18.28751739 |
| 2021-08-02   | 2021-08-04 |    0.784586925 |
| 2021-07-30   | 2021-08-03 |    0.729942907 |
| 2021-07-30   | 2021-08-04 |    0.081397181 |
| 2021-07-30   | 2021-08-02 |   -0.225626914 |
| 2021-08-03   | 2021-08-04 |   -0.834880227 |
+--------------+------------+----------------+

预期输出

如果我想看到最好的rate_of_return,得到的数据帧将是

+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-02   | 2021-08-03 |    18.28751739 |
+--------------+------------+----------------+

如果我想看看rate_of_return的最坏情况,那么得到的数据帧将是

+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-03   | 2021-08-04 |   -0.834880227 |
| 2021-07-30   | 2021-08-02 |   -0.225626914 |
+--------------+------------+----------------+
  • 测试所有场景以计算rate_of_return的最佳方法是什么
  • 我如何才能达到预期的产出,使周期不重叠?(见预期产出(
  • 最佳/最差数据帧不是符号相关的,在没有时间段重叠的情况下,最佳数据帧可以包含负rate_of_returns
  • 如果公式更改为(period_end_price/period_start_price) - 1(不依赖于时间(,该方法是什么

如果我理解正确,你的问题有两个部分-

第1部分:生成组合

对于生成组合,可以使用itertools,计算每个组合的返回值并对结果进行排序。

from itertools import combinations
rors = []
for combination in combinations(zip(df['Date'], df['Price']), 2):
(start_date, start_price), (end_date, end_price) = combination
ror = (end_price / start_price) ** (compounding_days / (end_date - start_date).days) - 1
rors.append((start_date, end_date, ror))
sorted_rors = sorted(rors, key=lambda x: x[2], reverse=True)
print(sorted_rors[0])
#(Timestamp('2021-08-02 00:00:00'),
# Timestamp('2021-08-03 00:00:00'),
# 18.28751738702541)
print(sorted_rors[-1])
#(Timestamp('2021-08-03 00:00:00'),
# Timestamp('2021-08-04 00:00:00'),
# -0.8348802270491325)

第2部分:非重叠时间段

我不太清楚这一部分,但我猜您正试图找到时间段不重叠的前n个返回。如果你正在查看的时间段数量很大,你可以考虑使用生成器函数-

def next_non_overlapping(iterable):
it = iter(iterable)
first_start, first_end, first_ror = next(it)
yield (first_start, first_end, first_ror)
while True:
try:
next_start, next_end, next_ror = next(it)
if next_start >= first_end or next_end <= first_start:
yield (next_start, next_end, next_ror)
first_start, first_end, first_ror = next_start, next_end, next_ror
except StopIteration:
print("No more items")
break
nno = next_non_overlapping(sorted_rors)
print(next(nno))
#(Timestamp('2021-08-02 00:00:00'),
# Timestamp('2021-08-03 00:00:00'),
# 18.28751738702541)
print(next(nno))
#(Timestamp('2021-07-30 00:00:00'),
# Timestamp('2021-08-02 00:00:00'),
# -0.22562691374181088)
print(next(nno))
#(Timestamp('2021-08-03 00:00:00'),
# Timestamp('2021-08-04 00:00:00'),
# -0.8348802270491325)
print(next(nno))
# No more items

为了获得n个最低返回,您可以简单地将反向列表传递给函数,即

nnor = next_non_overlapping(reversed(sorted_rors))

首先,如果时间序列是每天的,问题会更容易解决。所以我会这么做:

df.set_index('Date').resample('d').mean().reset_index()

这让我们进入:

日期02021-07-30 00:00:00438.5112021-07-31 00:00:00an22021-08-01 00:00:00an32021-08-02 00:00:00437.5942021-08-03 00:00:00441.1552021-08-04 00:00:00438.98

定义您的函数,您可以直接传递数据帧和开始、结束日期:

import numpy as np
import pandas as pd
data = {'Date': ['2021-07-30', '2021-08-02', '2021-08-03', '2021-08-04'],
'Price': [438.51, 437.59, 441.15, 438.98]
}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
normalisation_days = 365.25
compounding_days = 365.25
def rate_ret(df, start_date, end_date):
start = df[df.Date==start_date].iloc[0]
end = df[df.Date==end_date].iloc[0]
period_start_price = start.Price
period_end_price = end.Price
days_in_between = (end.Date - start.Date).days
return ((period_end_price/period_start_price)**(compounding_days/days_in_between)-1) * (normalisation_days/compounding_days)
# Iterate over all possible date intervals creating an array (or matrix),
#in the second `for` loop, we only include dates bigger than the starting date:
array = []
for start_date in df.Date:
for end_date in df.Date[df.Date>start_date]:
array.append([rate_ret(df, start_date, end_date), start_date, end_date])
print(array)
# To extract the best and the worst periods with no overlapping, 
# take the best save it and iteratively save the next comparing if they collide or not with the previous stored intervals:
def extract_non_overlaping(df):
saved_rows = [df.iloc[0]]
for i,row in df.iterrows():
for saved in saved_rows:
if (row['Period End'] < saved['Period Start']) or (row['Period Start'] > saved['Period End']):
saved_rows.append(row)
break # avoid saving duplicates
return pd.DataFrame(saved_rows, columns=['Rate of Return','Period Start','Period End'])
df_higher  = pd.DataFrame(array, columns=['Rate of Return','Period Start','Period End']).reset_index(drop=True).sort_values(['Rate of Return'],ascending=False)
df_lower  = pd.DataFrame(array, columns=['Rate of Return','Period Start','Period End']).reset_index(drop=True).sort_values(['Rate of Return'])
extract_non_overlaping(df_higher)
extract_non_overlaping(df_lower)

结果更低:

+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-02   | 2021-08-03 |    18.28751739 |
+--------------+------------+----------------+

更高:

+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-03   | 2021-08-04 |   -0.834880227 |
| 2021-07-30   | 2021-08-02 |   -0.225626914 |
+--------------+------------+----------------+

如果公式不依赖于时间,只需更改rete_ret定义中的公式即可。

pd:你可以做一些优化,但总的来说,代码是有效的。

最新更新