熊猫 - 拆分和重构以及重载 ID 列 - Pandas - Split and refactor and overloaded ID column 小贝子编程网

我有一个pandasDataFrame，其中包含patient_id，patient_sex，patient_dob列(以及其他不太相关的列)。行可以具有重复的patient_id，因为每个患者在多个医疗过程的数据中可能有多个条目。然而，我发现，许多patient_id都是超负荷的，即不止一个患者被分配到同一个id(一个patient_id与多个性别和多个出生日相关联的许多实例证明了这一点)。

为了重构ids，使每个患者都有一个唯一的id，我的计划是不仅按patient_id，而且按patient_sex和patient_dob对数据进行分组。我认为这必须足以将数据分成单独的用户(如果两个具有相同性别和 dob 的患者恰好被分配了相同的 id，那就这样吧。

这是我目前使用的代码：

# I just use first() here as a way to aggregate the groups into a DataFrame.
# Bonus points if you have a better solution!
indv_patients = patients.groupby(['patient_id', 'patient_sex', 'patient_dob']).first()
# Create unique ids
new_patient_id = 'new_patient_id'
for index, row in indv_patients.iterrows():
# index is a tuple of the three column values, so this should get me a unique 
# patient id for each patient
indv_patients.loc[index, new_patient_id] = str(hash(index))
# Merge new ids into original patients frame
patients_with_new_ids = patients.merge(indv_patients, left_on=['patient_id', 'patient_sex', 'patient_dob'], right_index=True)
# Remove byproduct columns, and original id column
drop_columns = [col for col in patients_with_new_ids.columns if col not in patients.columns or col == new_patient_id]
drop_columns.append('patient_id')
patients_with_new_ids = patients_with_new_ids.drop(columns=drop_columns)
patients = patients_with_new_ids.rename(columns={new_patient_id : 'patient_id'})

问题在于，有超过700万患者，这是一个太慢的解决方案，最大的瓶颈是for循环。所以我的问题是，有没有更好的方法来修复这些重载的 id？(实际ID无关紧要，只要每个患者都是唯一的)

我不知道列的值是什么，但你试过这样的事情吗？

patients['new_patient_id'] = patients.apply(lambda x: x['patient_id'] + x['patient_sex'] + x['patient_dob'],axis=1)

这应该创建一个新列，然后您可以将 groupby 与 new_patient_id

熊猫 - 拆分和重构以及重载 ID 列

相关内容

最新更新

热门标签：