比较行数据帧|shift和应用函数抛出异常



我正试图通过ID推导出在特定状态下花费的平均持续时间的平均值。为此,我首先根据ID和日期对我的数据帧进行排序,并使用apply和shift函数尝试推断行[I+1]-行[I]-给定行[I+1]-行[I]的日期。

我得到以下异常:AttributeError:"int"对象没有属性"shift">

下面是模拟代码:

import datetime
from datetime import datetime
today =  datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['duration'] = frame_ordered.apply(lambda x: x['date'].shift(-1) - x['date'] if x['id'] == x['id'].shift(-1) else today - x['date'], axis=1)

有人能建议如何用lambda函数求解最后一行吗?

我无法用lambda完成它。你可以这样尝试:

import datetime
today =  datetime.datetime.today() # you want it as real date, not string
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame['date'] = pd.to_datetime(frame['date']) #convert date column to datetime
frame_ordered = frame.sort_values(['id','date'], ascending=True)
#add column with shifted date values
frame_ordered['shifted'] = frame_ordered['date'].shift(-1)
# mask where the next row has same id as current one
mask = frame_ordered['id'] == frame_ordered['id'].shift(-1)
print(mask)
# subtract date and shifted date if mask is true, otherwise subtract date from today. ".dt.days" only displays the days, not necessary 
frame_ordered['duration'] = np.where(mask, (frame_ordered['shifted']-frame_ordered['date']).dt.days, (today-frame_ordered['date']).dt.days)
#delete shifted date column if you want
frame_ordered = frame_ordered.drop('shifted', axis=1)
print(frame_ordered)

输出:

#mask
0    False
4    False
2    False
3     True
1    False
Name: id, dtype: bool
#frame_ordered
id  status       date  duration
0  1245       1 2022-07-01      25.0
4  1248       6 2022-01-03     204.0
2  2345       4 2022-04-20      97.0
3  4556       5 2022-02-02      38.0
1  4556       2 2022-03-12     136.0

我认为这些值没有被解释为pandas时间戳。有了正确的转换,应该很容易:

import datetime
from datetime import datetime
today =  datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame['date'] = pd.to_datetime(frame['date'])
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['shifted'] = frame_ordered['date'].shift(1)
frame_ordered['Difference'] = frame_ordered['date']-frame_ordered['date'].shift(1)
print(frame_ordered)

它打印出

id  status       date    shifted Difference
0  1245       1 2022-07-01        NaT        NaT
4  1248       6 2022-01-03 2022-07-01  -179 days
2  2345       4 2022-04-20 2022-01-03   107 days
3  4556       5 2022-02-02 2022-04-20   -77 days
1  4556       2 2022-03-12 2022-02-02    38 days

最新更新