我目前有一个Pandas DataFrame,其中包含资产价格的时间序列和包含"状态"的列。有三种状态-1,0,1出现在数据的不同点。
我试图找到这些状态下资产的平均回报,理想情况下使用矢量化方法。
下面是一个DataFrame的例子:| Timestamp | mid_price | state |
|----------------------------------|-----------|-------|
| 2020-01-01 00:05:25+00:00 | 528.50 | 0 |
| 2020-01-01 00:05:25.500000+00:00 | 529.00 | 0 |
| 2020-01-01 00:05:26+00:00 | 527.50 | 1 |
| 2020-01-01 00:05:26.500000+00:00 | 528.00 | 1 |
| 2020-01-01 00:05:27+00:00 | 529.00 | 1 |
| 2020-01-01 00:05:27.500000+00:00 | 531.50 | -1 |
| 2020-01-01 00:05:28+00:00 | 531.50 | -1 |
| 2020-01-01 00:05:28.500000+00:00 | 532.50 | 1 |
| 2020-01-01 00:05:29+00:00 | 532.50 | 1 |
| 2020-01-01 00:05:29.500000+00:00 | 530.00 | 1 |
我试图计算每个状态的平均回报,因此,例如状态1:
First sequence: (529 - 527.50) / 527.50 = 0.0028...
Second sequence: (530 - 532.50) / 532.50 = -0.0047...
Average return: (0.0028 - 0.0047) / 2 = -0.00095
是否有一个简洁的,矢量化的方法来做到这一点?
为每个序列创建和ID,并使用groupby:
df['Seq'] = (df.state.diff() != 0).cumsum() # ID for each sequence
df.groupby(['Seq', 'state']).agg(
{'mid_price': lambda x: (x.values[-1] - x.values[0]) / x.values[0]} # compute return of first each sequence
).groupby('state').mid_price.mean() # compute average return for each state
计算Seq
列,使属于一个序列的所有行具有相同的ID:
Timestamp mid_price state Seq
0 2020-01-01 00:05:25+00:00 528.5 0 1
1 2020-01-01 00:05:25.500000+00:00 529.0 0 1
2 2020-01-01 00:05:26+00:00 527.5 1 2
3 2020-01-01 00:05:26.500000+00:00 528.0 1 2
4 2020-01-01 00:05:27+00:00 529.0 1 2
5 2020-01-01 00:05:27.500000+00:00 531.5 -1 3
6 2020-01-01 00:05:28+00:00 531.5 -1 3
7 2020-01-01 00:05:28.500000+00:00 532.5 1 4
8 2020-01-01 00:05:29+00:00 532.5 1 4
9 2020-01-01 00:05:29.500000+00:00 530.0 1 4
主要思想是通过diff
来识别state
值变化的state
列的元素,如果状态与前一列不同,则差值将不等于零。然后,您可以使用cumsum
创建一个递增ID,即累积和。这还可以工作,因为diff返回的第一个元素是NaN,它与零不同。
一旦你有了Seq
列,你的问题的解决方案很简单:首先,按state
和Seq
分组,得到每个序列的回报,然后计算每个状态的平均回报。
import pandas as pd
df = pd.DataFrame(
columns=["Timestamp", "mid_price", "state"],
data=[
["2020-01-01 00:05:25+00:00", 528.50, 0],
["2020-01-01 00:05:25.500000+00:00", 529.00, 0],
["2020-01-01 00:05:26+00:00", 527.50, 1],
["2020-01-01 00:05:26.500000+00:00", 528.00, 1],
["2020-01-01 00:05:27+00:00", 529.00, 1],
["2020-01-01 00:05:27.500000+00:00", 531.50, -1],
["2020-01-01 00:05:28+00:00", 531.50, -1],
["2020-01-01 00:05:28.500000+00:00", 532.50, 1],
["2020-01-01 00:05:29+00:00", 532.50, 1],
["2020-01-01 00:05:29.500000+00:00", 530.00, 1],
],
)
df['Timestamp'] = pd.to_datetime(df.Timestamp)
df['Seq'] = (df.state.diff() != 0).cumsum()
df.groupby(['Seq', 'state']).agg(
{'mid_price': lambda x: (x.values[-1] - x.values[0]) / x.values[0]} # compute return of first each sequence
).groupby('state').mid_price.mean() # compute average return for each state
state
-1 0.000000
0 0.000946
1 -0.000926
Name: mid_price, dtype: float64