我想计算每个人在两个时间点之间心理健康得分的变化。
每个用户都有一个名字,以及来自3个不同时间点的心理健康评分。我想计算时间点3和1之间心理健康评分的变化
下面是df的例子,我从
开始:User Timepoint Mental Health Score
Bill 1 5
Bill 2 10
Bill 3 15
Wiz 1 10
Wiz 2 10
Wiz 3 15
Sam 1 5
Sam 2 5
Sam 3 5
这是期望的输出:
User Timepoint Mental Health Score Change in Mental Health (TP1 and 3)
Bill 1 5
Bill 2 10
Bill 3 15 10
Wiz 1 10
Wiz 2 10
Wiz 3 15 5
Sam 1 5
Sam 2 5
Sam 3 5 0
有人知道怎么做吗?
您可以使用shift()
和np.where()
df['Change in Mental Health (TP1 and 3)'] = df['Mental Health Score'] - df['Mental Health Score'].shift(2)
df['Change in Mental Health (TP1 and 3)'] = np.where(df['Timepoint'] != 3, 0, df['Change in Mental Health (TP1 and 3)']).astype(int)
df
尝试groupby
和where
:
#sort by Timepoint if needed
#df = df.sort_values("Timepoint")
changes = df.groupby("User")["Mental Health Score"].transform('last')-df.groupby("User")["Mental Health Score"].transform('first')
df["Change"] = changes.where(df["Timepoint"].eq(3))
>>> df
User Timepoint Mental Health Score Change
0 Bill 1 5 NaN
1 Bill 2 10 NaN
2 Bill 3 15 10.0
3 Wiz 1 10 NaN
4 Wiz 2 10 NaN
5 Wiz 3 15 5.0
6 Sam 1 5 NaN
7 Sam 2 5 NaN
8 Sam 3 5 0.0
正如评论中所述,您可以在User
上groupby
您的数据帧,并在Mental Health Score
上计算差异
我在这里放了一个代码片段来演示
def _overall_change(scores):
return scores.iloc[-1] - scores.iloc[0]
person = df.groupby('User')['Score'].agg(_overall_change)
使用groupby
和merge
:
g = df.sort_values(by='Timepoint').groupby('User')['Mental Health Score']
s = pd.concat({3: g.last()-g.first()})
# User
# 3 Bill 10
# Sam 0
# Wiz 5
# Name: Mental Health Score, dtype: int64
df.merge(s, left_on=['Timepoint', 'User'], right_index=True, how='left')
输出:
User Timepoint Mental Health Score_x Mental Health Score_y
0 Bill 1 5 NaN
1 Bill 2 10 NaN
2 Bill 3 15 10.0
3 Wiz 1 10 NaN
4 Wiz 2 10 NaN
5 Wiz 3 15 5.0
6 Sam 1 5 NaN
7 Sam 2 5 NaN
8 Sam 3 5 0.0
这是另一个可能的解决方案:
import pandas as pd
def calculate_change(mhs):
mhs = list(mhs)
return mhs[-1] - mhs[0]
df = df.sort_values(["User", "Timepoint"])
diff = df.groupby('User')['Mental Health Score'].agg(calculate_change)
df = pd.merge(df, diff, how='left', left_on='User', right_index=True)
df.columns = ['User', 'Timepoint', 'Mental Health Score', 'Change']
df['Change'] = df['Change'].loc[df['Timepoint']==3]
print(df)
输出User Timepoint Mental Health Score Change
0 Bill 1 5 NaN
1 Bill 2 10 NaN
2 Bill 3 15 10.0
3 Wiz 1 10 NaN
4 Wiz 2 10 NaN
5 Wiz 3 15 5.0
6 Sam 1 5 NaN
7 Sam 2 5 NaN
8 Sam 3 5 0.0