我正试图通过ID推导出在特定状态下花费的平均持续时间的平均值。为此,我首先根据ID和日期对我的数据帧进行排序,并使用apply和shift函数尝试推断行[I+1]-行[I]-给定行[I+1]-行[I]的日期。
我得到以下异常:AttributeError:"int"对象没有属性"shift">
下面是模拟代码:
import datetime
from datetime import datetime
today = datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['duration'] = frame_ordered.apply(lambda x: x['date'].shift(-1) - x['date'] if x['id'] == x['id'].shift(-1) else today - x['date'], axis=1)
有人能建议如何用lambda函数求解最后一行吗?
我无法用lambda完成它。你可以这样尝试:
import datetime
today = datetime.datetime.today() # you want it as real date, not string
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame['date'] = pd.to_datetime(frame['date']) #convert date column to datetime
frame_ordered = frame.sort_values(['id','date'], ascending=True)
#add column with shifted date values
frame_ordered['shifted'] = frame_ordered['date'].shift(-1)
# mask where the next row has same id as current one
mask = frame_ordered['id'] == frame_ordered['id'].shift(-1)
print(mask)
# subtract date and shifted date if mask is true, otherwise subtract date from today. ".dt.days" only displays the days, not necessary
frame_ordered['duration'] = np.where(mask, (frame_ordered['shifted']-frame_ordered['date']).dt.days, (today-frame_ordered['date']).dt.days)
#delete shifted date column if you want
frame_ordered = frame_ordered.drop('shifted', axis=1)
print(frame_ordered)
输出:
#mask
0 False
4 False
2 False
3 True
1 False
Name: id, dtype: bool
#frame_ordered
id status date duration
0 1245 1 2022-07-01 25.0
4 1248 6 2022-01-03 204.0
2 2345 4 2022-04-20 97.0
3 4556 5 2022-02-02 38.0
1 4556 2 2022-03-12 136.0
我认为这些值没有被解释为pandas时间戳。有了正确的转换,应该很容易:
import datetime
from datetime import datetime
today = datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame['date'] = pd.to_datetime(frame['date'])
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['shifted'] = frame_ordered['date'].shift(1)
frame_ordered['Difference'] = frame_ordered['date']-frame_ordered['date'].shift(1)
print(frame_ordered)
它打印出
id status date shifted Difference
0 1245 1 2022-07-01 NaT NaT
4 1248 6 2022-01-03 2022-07-01 -179 days
2 2345 4 2022-04-20 2022-01-03 107 days
3 4556 5 2022-02-02 2022-04-20 -77 days
1 4556 2 2022-03-12 2022-02-02 38 days