Pandas - idxmin在多个列上保持所有关系



我有一个DF,看起来像这样:

Virus         Host  blastRank  crisprRank  mashRank
0      NC_000866|1  NC_017660|1        1.0         inf       inf
1      NC_000871|1  NC_017595|1        1.0         inf       inf
2      NC_000872|1  NC_017595|1        1.0         inf       inf
3      NC_000896|1  NC_008530|1        1.0         inf       inf
4      NC_000902|1  NC_011353|1        1.0         inf       inf
...            ...          ...        ...         ...       ...
51935  NC_024392|1  NC_021824|1        inf         inf       1.0
51936  NC_024392|1  NC_021829|1        inf         inf       1.0
51937  NC_024392|1  NC_021837|1        inf         inf       1.0
51938  NC_024392|1  NC_021872|1        inf         inf       1.0
51939  NC_024392|1  NC_022737|1        inf         inf       1.0

我想要的是将这个df按Virus分组,并且对于每一组,在每一列中取等于min的行(第一行是列blastRank为min的行,第二行是列crisprRank为min的行,等等)。如果有多个最小值,那么我想保留所有列。我还必须以一种方式来做,它将支持不止这3列(我的程序必须支持超过3个数字列,这就是为什么我使用df[df.columns.to_list()[2:]]

这是我的代码和它产生的df:

df = df.groupby(['Virus'], as_index=False).apply(lambda x: x.loc[x[x.columns.to_list()[2:]].idxmin()].reset_index(drop=True))

Virus         Host  blastRank  crisprRank  mashRank
0   0  NC_000866|1  NC_017660|1        1.0         inf       inf
1  NC_000866|1  NC_017660|1        1.0         inf       inf
2  NC_000866|1  NC_002163|1        inf         inf       1.0
1   0  NC_000871|1  NC_017595|1        1.0         inf       inf
1  NC_000871|1  NC_006449|1        inf         1.0       1.0
...            ...          ...        ...         ...       ...
818 1  NC_024391|1  NC_009641|1        1.0         inf       inf
2  NC_024391|1  NC_003103|1        inf         inf       1.0
819 0  NC_024392|1  NC_021823|1        1.0         1.0       inf
1  NC_024392|1  NC_021823|1        1.0         1.0       inf
2  NC_024392|1  NC_003212|1        inf         inf       1.0

可以看到,idxmin()只返回第一个最小值。我想做一些类似idxmin(keep='all')的事情来获得所有的领带。

我认为您需要为所有关系测试每组的最小值:

cols = df.columns.to_list()[2:]
f = lambda x: x.apply(lambda x: x[x == x.min()].reset_index(drop=True))
df = df.groupby(['Virus'])[cols].apply(f)

如果需要所有值按原顺序排列:

cols = df.columns.to_list()[2:]
f = lambda x: x[cols].where(x[cols].eq(x[cols].min()))
df[cols] = df.groupby(['Virus'], as_index=False).apply(f)
df = df.dropna(subset=cols, how='all')

或:

df = df.melt(['Virus','Host'])
df1 = df[df.groupby(['Virus','variable'])['value'].transform('min').eq(df['value'])].copy()
df1 = df1.pivot(['Virus','Host'],'variable','value')
print (df1)

有一种方法可以解决这个问题:

import numpy as np
import pandas as pd
from io import StringIO
data = StringIO("""
Virus         Host  blastRank  crisprRank  mashRank
0      NC_000866|1  NC_017660|1        1.0         5       8
1      NC_000866|1  NC_017595|1        2.0         4       5
2      NC_000872|1  NC_017595|1        3.0         3       10
3      NC_000872|1  NC_008530|1        4.0         0       3
4      NC_000872|1  NC_011353|1        5.0         1       -3
""")
df = pd.read_csv(data, sep='s+').convert_dtypes()
cols_of_interest = [c for c in df.columns if c not in ['Virus', 'Host']]
def get_all_min(sdf):    
sdf_min = sdf.min().to_frame().T
result = pd.concat([pd.merge(sdf, sdf_min[[c]], how='inner') for c in sdf_min.columns if c in cols_of_interest])
result = result.drop_duplicates().reset_index(drop=True)
return result
df.groupby('Virus', as_index=False).apply(get_all_min).reset_index(drop=True)

相关内容

  • 没有找到相关文章

最新更新