>我有一个看起来像这样的数据
subject_id hour_measure urine color heart_rate
3 1 red 40
3 1.15 red 60
4 2 yellow 50
我想重新索引数据,以便为每位患者进行 24 小时的测量 我使用以下代码
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,24)],
names=['subject_id','hour_measure'])
df = df.groupby(['subject_id','hour_measure']).mean().reindex(mux).reset_index()
df.to_csv('totalafterreindex.csv')
它适用于数值,但对于分类值它删除了它, 如何增强此代码以将平均值用于数字,最常见的用于分类
想要的输出
subject_id hour_measure urine color heart_rate
3 1 red 40
3 2 red 60
3 3 yellow 50
3 4 yellow 50
.. .. ..
Idea 是使用GroupBy.agg
与mean
表示数字,mode
用于分类,如果返回空值,也与iter
一起添加next
返回mode
None
s:
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,24)],
names=['subject_id','hour_measure'])
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else next(iter(x.mode()), None)
df1 = df.groupby(['subject_id','hour_measure']).agg(f).reindex(mux).reset_index()
详情:
print (df.groupby(['subject_id','hour_measure']).agg(f))
urine color heart_rate
subject_id hour_measure
3 1.00 red 40
1.15 red 60
4 2.00 yellow 50
最后(如有必要(,每次subject_id
使用GroupBy.ffill
正向填充缺失值:
cols = df.columns.difference(['subject_id','hour_measure'])
df[cols] = df.groupby('subject_id')[cols].ffill()