我正在处理一个化学数据集,我想知道做下面事情的最聪明的方法。我的数据集看起来像这样:
formula Temperature (Kelvin) (Physical) Property Value
CO2 298 5
CO2 298 7.6
CO2 300 3.2
NaCl 300 3.4
NaCl 296 1.4
H2O 298 7.2
H2O 298 8.3
H2O 293 6.4
ZnO 300 3.10
ZnO 290 1.2
FeO 295 4.6
FeO 290 3.6
给定室温:= 298K,
我想要完成的是过滤原始数据集,以便只有在室温可用时报告的值,如果在室温下没有报告值,我想保持最接近值为可用的室温。根据我想要实现的目标,上面的样本初始数据集将变成类似于 的内容formula Temperature (Kelvin) (Physical) Property Value
CO2 298 5
CO2 298 7.6
NaCl 300 3.4
H2O 298 7.2
H2O 298 8.3
ZnO 300 3.10
FeO 295 4.6
也许我应该使用lambda
表达式?
对于如何实现这样的东西有什么建议吗?
许多谢谢,詹姆斯我们可以先过滤" good ";一种是温度为298k的。然后我们可以对剩下的行按照它们的距离排序到298k然后删除重复的,只保留最接近的。我们最后合并了好的和这些:
# room temp in K
rt = 298
# taking those that have `rt` K temp
good_ones = df[df["Temperature (Kelvin)"].eq(rt)]
good_names = good_ones.formula.unique()
# getting others
others = df[~df.formula.isin(good_names)]
# filtering others according to distance to `rt`
sorter = lambda s: s.sub(rt).abs()
others_filtered = (others
.sort_values("Temperature (Kelvin)", key=sorter)
.drop_duplicates("formula", keep="first"))
# merging them all
result = pd.concat([good_ones, others_filtered]).sort_index(ignore_index=True)
,
>>> result
formula Temperature (Kelvin) (Physical) Property Value
0 CO2 298 5.0
1 CO2 298 7.6
2 NaCl 300 3.4
3 H2O 298 7.2
4 H2O 298 8.3
5 ZnO 300 3.1
6 FeO 295 4.6
还有apply
方式:
def filter_temp(gr):
# get he temp column and a bool series where it equals `rt`
temps = gr["Temperature (Kelvin)"]
rt_temps = temps.eq(rt)
# does any temp match `rt`?
if rt_temps.any():
# then return the locations it matches
return gr[rt_temps]
else:
# otherwise return the closest one
return gr.loc[[gr.temps.sub(rt).abs().idxmin()]]
result = (df.groupby("formula", as_index=False, group_keys=False)
.apply(filter_temp).sort_index(ignore_index=True))
这里的想法是通过formula
分组行。然后对每个组进行筛选,以保留具有所需室温的所有行(如果有的话)或具有最接近温度的唯一行。让我们定义这个函数:
def temperature_filter(df, room_temp, temp_col="Temperature (Kelvin)"):
if room_temp in df[temp_col].values:
return df[df[temp_col] == room_temp]
else:
return df.loc[[abs(df[temp_col] - room_temp).idxmin()]]
只需将此函数应用于每个组:
ROOM_TEMP = 298
df.groupby("formula", sort=False).apply(temperature_filter, ROOM_TEMP).droplevel("formula")
请注意,temperature_filter
已经写得很清楚,bun也可以作为lambda函数包含,以达到一行解决方案!
def filter_closest_to_rt(df, rt=298):
df['tmrt'] = df['Temperature (Kelvin)'].sub(rt).abs()
return df[df['tmrt'] == df.groupby('formula')['tmrt'].transform('min')].drop(columns='tmrt')
filter_closest_to_rt(df)
formula Temperature (Kelvin) (Physical) Property Value
0 CO2 298 5.0
1 CO2 298 7.6
3 NaCl 300 3.4
4 NaCl 296 1.4
5 H2O 298 7.2
6 H2O 298 8.3
8 ZnO 300 3.1
10 FeO 295 4.6
这个怎么样:
df['val'] = np.abs(df['Temperature (Kelvin)'] - 298)
df = df.sort_values(['formula', 'val'], ascending=[True, True])
df = df.drop_duplicates(subset='formula', keep="first")
为了确保您不会丢失任何298个重复项,另一个解决方案是:
df['val'] = np.abs(df['Temperature (Kelvin)'] - 298)
the_298s = df[df['Temperature (Kelvin)'] == 298]
others = df[df['Temperature (Kelvin)'] != 298]
others = others.sort_values(['formula', 'val'], ascending=[True, True])
others = others.drop_duplicates(subset='formula', keep="first")
the_298s_formulas = the_298s.formula.unique()
others = others[~ others.formula.isin(the_298s_formulas)]
final_df = the_298s.append(others)