我有以下以日期时间为索引的数据帧:
A
date
2020-01 1
2020-01 2
2020-02 3
2020-02 4
2020-03 5
2020-03 6
2020-04 7
2020-04 8
我想创建一个for循环,返回新的数据帧(直到数据结束(,结果是:
数据帧1
A
date
2020-01 1
2020-01 2
2020-02 3
2020-02 4
数据帧2
2020-02 3
2020-02 4
2020-03 5
2020-03 6
数据帧3
2020-03 5
2020-03 6
2020-04 7
2020-04 8
这个想法是一个"添加和删除"滚动。逻辑是:
- 前两个月返回数据帧1
- 删除第一个月并添加一个新月以返回数据帧2
- 继续,直到数据结束
我发现这很有用,但我不知道如何正确实现它。我已经尝试过了,当我进行滚动时,每个月的值被汇总。我想保持原有的价值观!
此外,如果我使用类似于的简单for循环
for i in range(len(df)):
print(df[i : i+n])
我可以根据数据帧的长度来实现我的结果。但实际上是如何根据月份进行的呢?
任何建议都将不胜感激,谢谢!
你可以试试这个:
import pandas as pd
d = {'date': ['2020-01','2020-01','2020-02','2020-02','2020-03','2020-03','2020-04','2020-04',], 'A': [1,2,3,4,5,6,7,8]}
df = pd.DataFrame(data=d)
df['date'] = pd.to_datetime(df['date'], format='%Y-%m')
result=[]
for date in df.date.unique():
result.append(df[(df['date']>=date)&(df['date']<=pd.to_datetime(date)+pd.DateOffset(months=1))])
输出:
for x in result:
print(x)
date A
0 2020-01-01 1
1 2020-01-01 2
2 2020-02-01 3
3 2020-02-01 4
date A
2 2020-02-01 3
3 2020-02-01 4
4 2020-03-01 5
5 2020-03-01 6
date A
4 2020-03-01 5
5 2020-03-01 6
6 2020-04-01 7
7 2020-04-01 8
date A
6 2020-04-01 7
7 2020-04-01 8
好吧,我得到你想要的了!试试这个:
import pandas as pd
import numpy as np
d = {'date': ['2020-01','2020-01','2020-02','2020-02','2020-03','2020-03','2020-04','2020-04',], 'A': [1,2,3,4,5,6,7,8]}
df = pd.DataFrame(data=d)
df['date'] = pd.to_datetime(df['date'], format='%Y-%m')
result=[]
for i,date in enumerate(df.date.unique()):
if i==0:
result.append(df[(df['date']>=date)&(df['date']<=pd.to_datetime(date)+pd.DateOffset(months=2))])
else:
try:
result.append(df[(df['date']>result[i-1].iloc[-1,:].date)&(df['date']<=result[i-1].iloc[-1,:].date+pd.DateOffset(months=2))])
except:pass
result = [i for i in result if not i.empty]
for res in result:
print(res)
在这里,我在前面的df中添加了一个日期条件。它将>而不是最后一个数据帧的最后日期。
df
Out[248]:
date A
0 2020-01-01 1
1 2020-01-01 2
2 2020-02-01 3
3 2020-02-01 4
4 2020-03-01 5
5 2020-03-01 6
6 2020-04-01 7
7 2020-04-01 8
for res in result:
print(res)
date A
0 2020-01-01 1
1 2020-01-01 2
2 2020-02-01 3
3 2020-02-01 4
4 2020-03-01 5
5 2020-03-01 6
date A
6 2020-04-01 7
7 2020-04-01 8
编辑:
result=[]
for i,date in enumerate(df.date.unique()):
if i==0:
result.append(df[(df['date']>=date)&(df['date']<=pd.to_datetime(date)+pd.DateOffset(months=14))]) #here you choose your time period (for the first df 14 months like in your exemple)
else:
try:
result.append(df[(df['date']>result[i-1].iloc[0,:].date+pd.DateOffset(months=3))&(df['date']<=result[i-1].iloc[0,:].date+pd.DateOffset(months=17))]) #here for the others df, you take all the row between the first date of the previous dataframe + 3 months and the first date of the previous dataframe + 14+3 months
except:pass
result = [i for i in result if not i.empty]
for res in result:
print(res)
IIUC,让我们使用滚动索引值创建一个数据帧字典,如下所示:
import pandas as pd
import numpy as np
from io import StringIO
# Create and read in dummy dataframe
txtfile = StringIO(""" A
date
2020-01 1
2020-01 2
2020-02 3
2020-02 4
2020-03 5
2020-03 6
2020-04 7
2020-04 8
2020-05 9
""")
df = pd.read_csv(txtfile, index_col=0,sep='ss+', engine='python')
# Use `pd.factorize` to create the position of each of the periods
pos, ldates = pd.factorize(df.index)
# Create pd.Series from a unique list of those positions
s = pd.Series(np.arange(pos.max()+1))
# Use this rolling-apply trick to create a sliding window of indexes positions
l_of_idx = []
s.rolling(2).apply(lambda x: l_of_idx.append(x.astype(int).tolist()) or 0)
# Note: the 2 in rolling indicates two months in this example
# Create a dictionary of dataframes for each of those windows
d_dfs = {n:df.loc[ldates[i]] for n,i in enumerate(l_of_idx)}
#Output each dataframe:
print('First dataframe:')
print(d_dfs[0])
print('nSecond dataframe')
print(d_dfs[1])
print('nThird dataframe')
print(d_dfs[2])
print('nFourth dataframe')
print(d_dfs[3])
输出字典值:
First dataframe:
A
date
2020-01 1
2020-01 2
2020-02 3
2020-02 4
Second dataframe
A
date
2020-02 3
2020-02 4
2020-03 5
2020-03 6
Third dataframe
A
date
2020-03 5
2020-03 6
2020-04 7
2020-04 8
Fourth dataframe
A
date
2020-04 7
2020-04 8
2020-05 9
这是另一种方法。我使用了@ScottBoston对df的定义,然后将索引转换为Datetimedf.index = pd.to_datetime(df.index)
然后,我为整个日期范围创建了一个PeriodIndex;将PeriodIndex压缩为自身的偏移量(以获取开始日期和结束日期(;并将结果存储在列表中。
# create unique sorted list of periods in the data set
idx = pd.period_range(start=df.index.min(), end=df.index.max(), freq='M')
months_in_window = 2
d_dfs = list()
# extract window from original data frame
for start, end in zip(idx, idx[months_in_window - 1:]):
d_dfs.append(df[start.start_time : end.end_time])
descs = ['First dataframe:', 'Second dataframe:',
'Third dataframe:', 'Fourth dataframe:']
for desc, d_df in zip(descs, d_dfs):
print(desc)
print(d_df, end='nn')
First dataframe:
A
date
2020-01-01 1
2020-01-01 2
2020-02-01 3
2020-02-01 4
Second dataframe:
A
date
2020-02-01 3
2020-02-01 4
2020-03-01 5
2020-03-01 6
Third dataframe:
A
date
2020-03-01 5
2020-03-01 6
2020-04-01 7
2020-04-01 8
Fourth dataframe:
A
date
2020-04-01 7
2020-04-01 8
2020-05-01 9