这是原始问题:按MIN分组,并从另一列中填充NAS
我有此数据框:
mydf = pd.DataFrame (data = {'uid': [1,1,1,2,2,3,4,4,4,4], 'pagename':
['home', 'blah',
'blah', 'home', 'blah', 'blah','blah','home','blah','blah'], 'startpage':
[np.nan, np.nan, np.nan, 'home',
'home', 'blah',np.nan,np.nan,np.nan,np.nan], 'date_time':
[0,1,2,5,9,1,1,2,3,4], 'page_event': [0,0,0,0,0,0,10,0,0,10]})
我想获得此数据框:
endingdf = pd.DataFrame (data = {'uid': [1,1,1,2,2,3,4,4,4,4], 'pagename':
['home', 'blah', 'blah', 'home', 'blah','blah','blah','home','blah','blah'],
'startpage': [np.nan, np.nan, np.nan, 'home',
'home','blah',np.nan,np.nan,np.nan,np.nan],
'date_time': [0,1,2,5,9,1,1,2,3,4], 'page_event': [0,0,0,0,0,0,10,0,0,10],
'new_start_page':['home', 'home', 'home', 'home', 'home', 'blah', 'home',
'home', 'home', 'home']})
我想做的是由UID
组组,如果startpage
是NULL
,则使用访问的第一个pagename
(最小date_time(,但仅在page_event = 0
时使用。因此,如果第一个pagename
具有page_event = 10
,则跳过直至page_event = 0
。
e = mydf.page_event
p = mydf.pagename
s = mydf.startpage
u = mydf.uid
m = e.mask(e == 10).groupby(u).apply(pd.Series.first_valid_index)
s.fillna(u.map(m).map(p), inplace=True)
print(mydf)
date_time page_event pagename startpage uid
0 0 0 home home 1
1 1 0 blah home 1
2 2 0 blah home 1
3 5 0 home home 2
4 9 0 blah home 2
5 1 0 blah blah 3
6 1 10 blah home 4
7 2 0 home home 4
8 3 0 blah home 4
9 4 10 blah home 4