数据帧滑动索引



我有以下以日期时间为索引的数据帧:

A
date      
2020-01  1
2020-01  2
2020-02  3
2020-02  4
2020-03  5
2020-03  6
2020-04  7
2020-04  8

我想创建一个for循环,返回新的数据帧(直到数据结束(,结果是:

数据帧1

A
date      
2020-01  1
2020-01  2
2020-02  3
2020-02  4

数据帧2

2020-02  3
2020-02  4
2020-03  5
2020-03  6

数据帧3

2020-03  5
2020-03  6
2020-04  7
2020-04  8

这个想法是一个"添加和删除"滚动。逻辑是:

  • 前两个月返回数据帧1
  • 删除第一个月并添加一个新月以返回数据帧2
  • 继续,直到数据结束

我发现这很有用,但我不知道如何正确实现它。我已经尝试过了,当我进行滚动时,每个月的值被汇总。我想保持原有的价值观!

此外,如果我使用类似于的简单for循环

for i in range(len(df)):
print(df[i : i+n])

我可以根据数据帧的长度来实现我的结果。但实际上是如何根据月份进行的呢?

任何建议都将不胜感激,谢谢!

你可以试试这个:

import pandas as pd 
d = {'date': ['2020-01','2020-01','2020-02','2020-02','2020-03','2020-03','2020-04','2020-04',], 'A': [1,2,3,4,5,6,7,8]}
df = pd.DataFrame(data=d)
df['date'] = pd.to_datetime(df['date'], format='%Y-%m')
result=[]
for date in df.date.unique():
result.append(df[(df['date']>=date)&(df['date']<=pd.to_datetime(date)+pd.DateOffset(months=1))])

输出:

for x in result:
print(x)

date  A
0 2020-01-01  1
1 2020-01-01  2
2 2020-02-01  3
3 2020-02-01  4
date  A
2 2020-02-01  3
3 2020-02-01  4
4 2020-03-01  5
5 2020-03-01  6
date  A
4 2020-03-01  5
5 2020-03-01  6
6 2020-04-01  7
7 2020-04-01  8
date  A
6 2020-04-01  7
7 2020-04-01  8

好吧,我得到你想要的了!试试这个:

import pandas as pd 
import numpy as np
d = {'date': ['2020-01','2020-01','2020-02','2020-02','2020-03','2020-03','2020-04','2020-04',], 'A': [1,2,3,4,5,6,7,8]}
df = pd.DataFrame(data=d)
df['date'] = pd.to_datetime(df['date'], format='%Y-%m')
result=[]
for i,date in enumerate(df.date.unique()):
if i==0:
result.append(df[(df['date']>=date)&(df['date']<=pd.to_datetime(date)+pd.DateOffset(months=2))])
else:
try:
result.append(df[(df['date']>result[i-1].iloc[-1,:].date)&(df['date']<=result[i-1].iloc[-1,:].date+pd.DateOffset(months=2))])
except:pass
result = [i for i in result if not i.empty] 
for res in result:
print(res)

在这里,我在前面的df中添加了一个日期条件。它将>而不是最后一个数据帧的最后日期。

df
Out[248]: 
date  A
0 2020-01-01  1
1 2020-01-01  2
2 2020-02-01  3
3 2020-02-01  4
4 2020-03-01  5
5 2020-03-01  6
6 2020-04-01  7
7 2020-04-01  8
for res in result:
print(res)
date  A
0 2020-01-01  1
1 2020-01-01  2
2 2020-02-01  3
3 2020-02-01  4
4 2020-03-01  5
5 2020-03-01  6
date  A
6 2020-04-01  7
7 2020-04-01  8

编辑:

result=[]
for i,date in enumerate(df.date.unique()):
if i==0:
result.append(df[(df['date']>=date)&(df['date']<=pd.to_datetime(date)+pd.DateOffset(months=14))]) #here you choose your time period (for the first df 14 months like in your exemple)
else:
try:
result.append(df[(df['date']>result[i-1].iloc[0,:].date+pd.DateOffset(months=3))&(df['date']<=result[i-1].iloc[0,:].date+pd.DateOffset(months=17))]) #here for the others df, you take all the row between the first date of the previous dataframe + 3 months and the first date of the previous dataframe + 14+3 months
except:pass
result = [i for i in result if not i.empty] 
for res in result:
print(res)

IIUC,让我们使用滚动索引值创建一个数据帧字典,如下所示:

import pandas as pd
import numpy as np
from io import StringIO
# Create and read in dummy dataframe
txtfile = StringIO("""        A
date      
2020-01  1
2020-01  2
2020-02  3
2020-02  4
2020-03  5
2020-03  6
2020-04  7
2020-04  8
2020-05  9
""")
df = pd.read_csv(txtfile, index_col=0,sep='ss+', engine='python')
# Use `pd.factorize` to create the position of each of the periods
pos, ldates = pd.factorize(df.index)

# Create pd.Series from a unique list of those positions
s = pd.Series(np.arange(pos.max()+1))
# Use this rolling-apply trick to create a sliding window of indexes positions
l_of_idx = []
s.rolling(2).apply(lambda x: l_of_idx.append(x.astype(int).tolist()) or 0)
# Note: the 2 in rolling indicates two months in this example
# Create a dictionary of dataframes for each of those windows
d_dfs = {n:df.loc[ldates[i]] for n,i in enumerate(l_of_idx)}
#Output each dataframe:
print('First dataframe:')
print(d_dfs[0])
print('nSecond dataframe')
print(d_dfs[1])
print('nThird dataframe')
print(d_dfs[2])
print('nFourth dataframe')
print(d_dfs[3])

输出字典值:

First dataframe:
A
date      
2020-01  1
2020-01  2
2020-02  3
2020-02  4
Second dataframe
A
date      
2020-02  3
2020-02  4
2020-03  5
2020-03  6
Third dataframe
A
date      
2020-03  5
2020-03  6
2020-04  7
2020-04  8
Fourth dataframe
A
date      
2020-04  7
2020-04  8
2020-05  9

这是另一种方法。我使用了@ScottBoston对df的定义,然后将索引转换为Datetimedf.index = pd.to_datetime(df.index)

然后,我为整个日期范围创建了一个PeriodIndex;将PeriodIndex压缩为自身的偏移量(以获取开始日期和结束日期(;并将结果存储在列表中。

# create unique sorted list of periods in the data set
idx = pd.period_range(start=df.index.min(), end=df.index.max(), freq='M')
months_in_window = 2
d_dfs = list()
# extract window from original data frame
for start, end in zip(idx, idx[months_in_window - 1:]):
d_dfs.append(df[start.start_time : end.end_time])

descs = ['First dataframe:', 'Second dataframe:', 
'Third dataframe:', 'Fourth dataframe:']
for desc, d_df in zip(descs, d_dfs):
print(desc)
print(d_df, end='nn')
First dataframe:
A
date         
2020-01-01  1
2020-01-01  2
2020-02-01  3
2020-02-01  4
Second dataframe:
A
date         
2020-02-01  3
2020-02-01  4
2020-03-01  5
2020-03-01  6
Third dataframe:
A
date         
2020-03-01  5
2020-03-01  6
2020-04-01  7
2020-04-01  8
Fourth dataframe:
A
date         
2020-04-01  7
2020-04-01  8
2020-05-01  9

相关内容

  • 没有找到相关文章

最新更新