根据列中的信息过滤出化学数据集



我正在处理一个化学数据集,我想知道做下面事情的最聪明的方法。我的数据集看起来像这样:

formula      Temperature (Kelvin)      (Physical) Property Value
  CO2             298                       5
  CO2             298                       7.6
  CO2             300                       3.2
  NaCl            300                       3.4
  NaCl            296                       1.4
  H2O             298                       7.2
  H2O             298                       8.3
  H2O             293                       6.4
  ZnO             300                       3.10
  ZnO             290                       1.2
  FeO             295                       4.6
  FeO             290                       3.6

给定室温:= 298K,

我想要完成的是过滤原始数据集,以便只有在室温可用时报告的值,如果在室温下没有报告值,我想保持最接近值为可用的室温。根据我想要实现的目标,上面的样本初始数据集将变成类似于 的内容
formula      Temperature (Kelvin)      (Physical) Property Value
  CO2             298                       5
  CO2             298                       7.6
  NaCl            300                       3.4
  H2O             298                       7.2
  H2O             298                       8.3
  ZnO             300                       3.10
  FeO             295                       4.6

也许我应该使用lambda表达式?

对于如何实现这样的东西有什么建议吗?

许多谢谢,

詹姆斯

我们可以先过滤" good ";一种是温度为298k的。然后我们可以对剩下的行按照它们的距离排序到298k然后删除重复的,只保留最接近的。我们最后合并了好的和这些:

# room temp in K
rt = 298
# taking those that have `rt` K temp
good_ones = df[df["Temperature (Kelvin)"].eq(rt)]
good_names = good_ones.formula.unique()
# getting others
others = df[~df.formula.isin(good_names)]
# filtering others according to distance to `rt`
sorter = lambda s: s.sub(rt).abs()
others_filtered = (others
                     .sort_values("Temperature (Kelvin)", key=sorter)
                     .drop_duplicates("formula", keep="first"))
# merging them all
result = pd.concat([good_ones, others_filtered]).sort_index(ignore_index=True)

,

>>> result
  formula  Temperature (Kelvin)  (Physical) Property Value
0     CO2                   298                        5.0
1     CO2                   298                        7.6
2    NaCl                   300                        3.4
3     H2O                   298                        7.2
4     H2O                   298                        8.3
5     ZnO                   300                        3.1
6     FeO                   295                        4.6

还有apply方式:

def filter_temp(gr):
    # get he temp column and a bool series where it equals `rt`
    temps = gr["Temperature (Kelvin)"]
    rt_temps = temps.eq(rt)
    # does any temp match `rt`?
    if rt_temps.any():
        # then return the locations it matches
        return gr[rt_temps]
    else:
        # otherwise return the closest one
        return gr.loc[[gr.temps.sub(rt).abs().idxmin()]]
result = (df.groupby("formula", as_index=False, group_keys=False)
            .apply(filter_temp).sort_index(ignore_index=True))

这里的想法是通过formula分组行。然后对每个组进行筛选,以保留具有所需室温的所有行(如果有的话)或具有最接近温度的唯一行。让我们定义这个函数:

def temperature_filter(df, room_temp, temp_col="Temperature (Kelvin)"):
   if room_temp in df[temp_col].values:
       return df[df[temp_col] == room_temp]
   else:
       return df.loc[[abs(df[temp_col] - room_temp).idxmin()]]

只需将此函数应用于每个组:

ROOM_TEMP = 298
df.groupby("formula", sort=False).apply(temperature_filter, ROOM_TEMP).droplevel("formula")

请注意,temperature_filter已经写得很清楚,bun也可以作为lambda函数包含,以达到一行解决方案!

def filter_closest_to_rt(df, rt=298):
    df['tmrt'] = df['Temperature (Kelvin)'].sub(rt).abs()
    return df[df['tmrt'] == df.groupby('formula')['tmrt'].transform('min')].drop(columns='tmrt')
filter_closest_to_rt(df)
   formula  Temperature (Kelvin)  (Physical) Property Value
0      CO2                   298                        5.0
1      CO2                   298                        7.6
3     NaCl                   300                        3.4
4     NaCl                   296                        1.4
5      H2O                   298                        7.2
6      H2O                   298                        8.3
8      ZnO                   300                        3.1
10     FeO                   295                        4.6

这个怎么样:

df['val'] = np.abs(df['Temperature (Kelvin)'] - 298)
df = df.sort_values(['formula', 'val'], ascending=[True, True])
df = df.drop_duplicates(subset='formula', keep="first")

为了确保您不会丢失任何298个重复项,另一个解决方案是:

df['val'] = np.abs(df['Temperature (Kelvin)'] - 298)
the_298s = df[df['Temperature (Kelvin)'] == 298]
others = df[df['Temperature (Kelvin)'] != 298]
others = others.sort_values(['formula', 'val'], ascending=[True, True])
others = others.drop_duplicates(subset='formula', keep="first")
the_298s_formulas = the_298s.formula.unique()
others = others[~ others.formula.isin(the_298s_formulas)]
final_df = the_298s.append(others)

最新更新