如何计算由Python中的另一列确定的单独连续范围的平均回报?



我目前有一个Pandas DataFrame,其中包含资产价格的时间序列和包含"状态"的列。有三种状态-1,0,1出现在数据的不同点。

我试图找到这些状态下资产的平均回报,理想情况下使用矢量化方法。

下面是一个DataFrame的例子:
| Timestamp                        | mid_price | state |
|----------------------------------|-----------|-------|
| 2020-01-01 00:05:25+00:00        | 528.50    | 0     |
| 2020-01-01 00:05:25.500000+00:00 | 529.00    | 0     |
| 2020-01-01 00:05:26+00:00        | 527.50    | 1     |
| 2020-01-01 00:05:26.500000+00:00 | 528.00    | 1     |
| 2020-01-01 00:05:27+00:00        | 529.00    | 1     |
| 2020-01-01 00:05:27.500000+00:00 | 531.50    | -1    |
| 2020-01-01 00:05:28+00:00        | 531.50    | -1    |
| 2020-01-01 00:05:28.500000+00:00 | 532.50    | 1     |
| 2020-01-01 00:05:29+00:00        | 532.50    | 1     |
| 2020-01-01 00:05:29.500000+00:00 | 530.00    | 1     |

我试图计算每个状态的平均回报,因此,例如状态1:

First sequence: (529 - 527.50) / 527.50 = 0.0028...
Second sequence: (530 - 532.50) / 532.50 = -0.0047...
Average return: (0.0028 - 0.0047) / 2 = -0.00095

是否有一个简洁的,矢量化的方法来做到这一点?

为每个序列创建和ID,并使用groupby:

df['Seq'] = (df.state.diff() != 0).cumsum()  # ID for each sequence
df.groupby(['Seq', 'state']).agg(
{'mid_price': lambda x: (x.values[-1] - x.values[0]) / x.values[0]}  # compute return of first each sequence
).groupby('state').mid_price.mean()  # compute average return for each state

计算Seq列,使属于一个序列的所有行具有相同的ID:

Timestamp  mid_price  state  Seq
0        2020-01-01 00:05:25+00:00      528.5      0    1
1 2020-01-01 00:05:25.500000+00:00      529.0      0    1
2        2020-01-01 00:05:26+00:00      527.5      1    2
3 2020-01-01 00:05:26.500000+00:00      528.0      1    2
4        2020-01-01 00:05:27+00:00      529.0      1    2
5 2020-01-01 00:05:27.500000+00:00      531.5     -1    3
6        2020-01-01 00:05:28+00:00      531.5     -1    3
7 2020-01-01 00:05:28.500000+00:00      532.5      1    4
8        2020-01-01 00:05:29+00:00      532.5      1    4
9 2020-01-01 00:05:29.500000+00:00      530.0      1    4

主要思想是通过diff来识别state值变化的state列的元素,如果状态与前一列不同,则差值将不等于零。然后,您可以使用cumsum创建一个递增ID,即累积和。这还可以工作,因为diff返回的第一个元素是NaN,它与零不同。

一旦你有了Seq列,你的问题的解决方案很简单:首先,按stateSeq分组,得到每个序列的回报,然后计算每个状态的平均回报。

下面是完整的代码和结果:
import pandas as pd
df = pd.DataFrame(
columns=["Timestamp", "mid_price", "state"],
data=[
["2020-01-01 00:05:25+00:00", 528.50, 0],
["2020-01-01 00:05:25.500000+00:00", 529.00, 0],
["2020-01-01 00:05:26+00:00", 527.50, 1],
["2020-01-01 00:05:26.500000+00:00", 528.00, 1],
["2020-01-01 00:05:27+00:00", 529.00, 1],
["2020-01-01 00:05:27.500000+00:00", 531.50, -1],
["2020-01-01 00:05:28+00:00", 531.50, -1],
["2020-01-01 00:05:28.500000+00:00", 532.50, 1],
["2020-01-01 00:05:29+00:00", 532.50, 1],
["2020-01-01 00:05:29.500000+00:00", 530.00, 1],
],
)
df['Timestamp'] = pd.to_datetime(df.Timestamp)
df['Seq'] = (df.state.diff() != 0).cumsum()
df.groupby(['Seq', 'state']).agg(
{'mid_price': lambda x: (x.values[-1] - x.values[0]) / x.values[0]}  # compute return of first each sequence
).groupby('state').mid_price.mean()  # compute average return for each state
state
-1    0.000000
0    0.000946
1   -0.000926
Name: mid_price, dtype: float64

相关内容

  • 没有找到相关文章

最新更新