多列Panda的滚动相关



我正试图计算并可视化180天(本例中为3天(窗口中多列之间的滚动相关性。

我的数据格式是这样的(在原始文件中有12列加上时间戳和数千行(:

import numpy as np
import pandas as pd
df = pd.DataFrame({"Timestamp" : ['1993-11-01' ,'1993-11-02', '1993-11-03', '1993-11-04','1993-11-15'], "Austria" : [6.18 ,6.18, 6.17, 6.17, 6.40],"Belgium" : [7.05, 7.05, 7.2, 7.5, 7.6],"France" : [7.69, 7.61, 7.67, 7.91, 8.61]},index = [1, 2, 3,4,5])
Timestamp   Austria Belgium France
1   1993-11-01  6.18    7.05    7.69
2   1993-11-02  6.18    7.05    7.61
3   1993-11-03  6.17    7.20    7.67
4   1993-11-04  6.17    7.50    7.91
5   1993-11-15  6.40    7.60    8.61

我不能只使用这个公式,因为如果我这样做,会因为时间戳列而出现格式错误:

df.rolling(2).corr(df)
ValueError: could not convert string to float: '1993-11-01'

当我删除"时间戳"列时,每个单元格的结果都是1.0,这也是不对的,此外,我还丢失了可视化图最终所需的时间戳。

df_drop = df.drop(columns=['Timestamp'])
df_drop.rolling(2).corr(df_drop)

Austria Belgium France
1   NaN NaN NaN
2   NaN NaN 1.0
3   1.0 1.0 1.0
4   -inf1.0 1.0
5   1.0 1.0 1.0

如何对多列和数据索引进行滚动关联有任何经验吗?

基于Shryans Jain的回答,我提出以下建议。它应该适用于任意数量的列:

import itertools as it
# omit timestamp-col
cols = list(df.columns)[1:]
# -> ['Austria', 'Belgium', 'France']
col_pairs = list(it.combinations(cols, 2))
# -> [('Austria', 'Belgium'), ('Austria', 'France'), ('Belgium', 'France')]
res = pd.DataFrame()
for pair in col_pairs:
# select the first three letters of each name of the pair
corr_name = f"{pair[0][:3]}_{pair[1][:3]}_corr"
res[corr_name] = df[list(pair)].
rolling(min_periods=1, window=3).
corr().iloc[0::2, -1].reset_index(drop=True)
print(str(res))
Aus_Bel_corr  Aus_Fra_corr  Bel_Fra_corr
0           NaN           NaN           NaN
1           NaN           NaN           NaN
2     -1.000000     -0.277350      0.277350
3     -0.755929     -0.654654      0.989743
4      0.693375      0.969346      0.849167

开始处的NaN值是窗口化的结果。

更新:我上传了一个笔记本,里面详细解释了循环中发生的事情。

https://github.com/cknoll/demo-material/blob/main/pandas/pandas_rolling_correlation_iloc.ipynb

您可能可以像这样计算成对相关性,而不是一次计算所有3个。

一旦有了相关性,就可以直接将它们作为列添加,从而保留时间戳。

df['Aus_Bel_corr'] = df[['Austria','Belgium']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Bel_Fin_corr'] = df[['Belgium','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Aus_Fin_corr'] = df[['Austria','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)```

我想还有另一种方法。

df['Aus_Bel_corr'] = df['Austria']
.rolling(min_periods = 1, window = 3)
.corr(df['Belgium'])

对我来说,我认为这比之前的答案有点简单。

相关内容

  • 没有找到相关文章

最新更新