计算分数的变化并添加到新列pandas



我想计算每个人在两个时间点之间心理健康得分的变化。

每个用户都有一个名字,以及来自3个不同时间点的心理健康评分。我想计算时间点3和1之间心理健康评分的变化

下面是df的例子,我从

开始:
User   Timepoint   Mental Health Score
Bill       1               5
Bill       2               10
Bill       3               15
Wiz        1               10
Wiz        2               10
Wiz        3               15
Sam        1               5
Sam        2               5
Sam        3               5

这是期望的输出:

User   Timepoint   Mental Health Score      Change in Mental Health (TP1 and 3) 
Bill       1               5                               
Bill       2               10
Bill       3               15                              10
Wiz        1               10
Wiz        2               10
Wiz        3               15                              5
Sam        1               5
Sam        2               5
Sam        3               5                               0

有人知道怎么做吗?

您可以使用shift()np.where()

完成此操作。
df['Change in Mental Health (TP1 and 3)'] = df['Mental Health Score'] - df['Mental Health Score'].shift(2)
df['Change in Mental Health (TP1 and 3)'] = np.where(df['Timepoint'] != 3, 0, df['Change in Mental Health (TP1 and 3)']).astype(int)
df

尝试groupbywhere:

#sort by Timepoint if needed
#df = df.sort_values("Timepoint")
changes = df.groupby("User")["Mental Health Score"].transform('last')-df.groupby("User")["Mental Health Score"].transform('first')
df["Change"] = changes.where(df["Timepoint"].eq(3))
>>> df
User  Timepoint  Mental Health Score  Change
0  Bill          1                    5     NaN
1  Bill          2                   10     NaN
2  Bill          3                   15    10.0
3   Wiz          1                   10     NaN
4   Wiz          2                   10     NaN
5   Wiz          3                   15     5.0
6   Sam          1                    5     NaN
7   Sam          2                    5     NaN
8   Sam          3                    5     0.0

正如评论中所述,您可以在Usergroupby您的数据帧,并在Mental Health Score上计算差异

我在这里放了一个代码片段来演示

def _overall_change(scores):
return scores.iloc[-1] - scores.iloc[0]
person = df.groupby('User')['Score'].agg(_overall_change)

使用groupbymerge:

g = df.sort_values(by='Timepoint').groupby('User')['Mental Health Score']
s = pd.concat({3: g.last()-g.first()})
#    User
# 3  Bill    10
#    Sam      0
#    Wiz      5
# Name: Mental Health Score, dtype: int64
df.merge(s, left_on=['Timepoint', 'User'], right_index=True, how='left')

输出:

User  Timepoint  Mental Health Score_x  Mental Health Score_y
0  Bill          1                      5                    NaN
1  Bill          2                     10                    NaN
2  Bill          3                     15                   10.0
3   Wiz          1                     10                    NaN
4   Wiz          2                     10                    NaN
5   Wiz          3                     15                    5.0
6   Sam          1                      5                    NaN
7   Sam          2                      5                    NaN
8   Sam          3                      5                    0.0

这是另一个可能的解决方案:

import pandas as pd
def calculate_change(mhs):
mhs = list(mhs)
return mhs[-1] - mhs[0]
df = df.sort_values(["User", "Timepoint"])
diff = df.groupby('User')['Mental Health Score'].agg(calculate_change)
df = pd.merge(df, diff, how='left', left_on='User', right_index=True)
df.columns = ['User', 'Timepoint', 'Mental Health Score', 'Change']
df['Change'] = df['Change'].loc[df['Timepoint']==3]
print(df)

输出
User  Timepoint  Mental Health Score  Change
0  Bill          1                    5     NaN
1  Bill          2                   10     NaN
2  Bill          3                   15    10.0
3   Wiz          1                   10     NaN
4   Wiz          2                   10     NaN
5   Wiz          3                   15     5.0
6   Sam          1                    5     NaN
7   Sam          2                    5     NaN
8   Sam          3                    5     0.0

最新更新