我有一个DF,看起来像这样:
Virus Host blastRank crisprRank mashRank
0 NC_000866|1 NC_017660|1 1.0 inf inf
1 NC_000871|1 NC_017595|1 1.0 inf inf
2 NC_000872|1 NC_017595|1 1.0 inf inf
3 NC_000896|1 NC_008530|1 1.0 inf inf
4 NC_000902|1 NC_011353|1 1.0 inf inf
... ... ... ... ... ...
51935 NC_024392|1 NC_021824|1 inf inf 1.0
51936 NC_024392|1 NC_021829|1 inf inf 1.0
51937 NC_024392|1 NC_021837|1 inf inf 1.0
51938 NC_024392|1 NC_021872|1 inf inf 1.0
51939 NC_024392|1 NC_022737|1 inf inf 1.0
我想要的是将这个df按Virus
分组,并且对于每一组,在每一列中取等于min的行(第一行是列blastRank为min的行,第二行是列crisprRank为min的行,等等)。如果有多个最小值,那么我想保留所有列。我还必须以一种方式来做,它将支持不止这3列(我的程序必须支持超过3个数字列,这就是为什么我使用df[df.columns.to_list()[2:]]
这是我的代码和它产生的df:
df = df.groupby(['Virus'], as_index=False).apply(lambda x: x.loc[x[x.columns.to_list()[2:]].idxmin()].reset_index(drop=True))
Virus Host blastRank crisprRank mashRank
0 0 NC_000866|1 NC_017660|1 1.0 inf inf
1 NC_000866|1 NC_017660|1 1.0 inf inf
2 NC_000866|1 NC_002163|1 inf inf 1.0
1 0 NC_000871|1 NC_017595|1 1.0 inf inf
1 NC_000871|1 NC_006449|1 inf 1.0 1.0
... ... ... ... ... ...
818 1 NC_024391|1 NC_009641|1 1.0 inf inf
2 NC_024391|1 NC_003103|1 inf inf 1.0
819 0 NC_024392|1 NC_021823|1 1.0 1.0 inf
1 NC_024392|1 NC_021823|1 1.0 1.0 inf
2 NC_024392|1 NC_003212|1 inf inf 1.0
可以看到,idxmin()
只返回第一个最小值。我想做一些类似idxmin(keep='all')
的事情来获得所有的领带。
我认为您需要为所有关系测试每组的最小值:
cols = df.columns.to_list()[2:]
f = lambda x: x.apply(lambda x: x[x == x.min()].reset_index(drop=True))
df = df.groupby(['Virus'])[cols].apply(f)
如果需要所有值按原顺序排列:
cols = df.columns.to_list()[2:]
f = lambda x: x[cols].where(x[cols].eq(x[cols].min()))
df[cols] = df.groupby(['Virus'], as_index=False).apply(f)
df = df.dropna(subset=cols, how='all')
或:
df = df.melt(['Virus','Host'])
df1 = df[df.groupby(['Virus','variable'])['value'].transform('min').eq(df['value'])].copy()
df1 = df1.pivot(['Virus','Host'],'variable','value')
print (df1)
有一种方法可以解决这个问题:
import numpy as np
import pandas as pd
from io import StringIO
data = StringIO("""
Virus Host blastRank crisprRank mashRank
0 NC_000866|1 NC_017660|1 1.0 5 8
1 NC_000866|1 NC_017595|1 2.0 4 5
2 NC_000872|1 NC_017595|1 3.0 3 10
3 NC_000872|1 NC_008530|1 4.0 0 3
4 NC_000872|1 NC_011353|1 5.0 1 -3
""")
df = pd.read_csv(data, sep='s+').convert_dtypes()
cols_of_interest = [c for c in df.columns if c not in ['Virus', 'Host']]
def get_all_min(sdf):
sdf_min = sdf.min().to_frame().T
result = pd.concat([pd.merge(sdf, sdf_min[[c]], how='inner') for c in sdf_min.columns if c in cols_of_interest])
result = result.drop_duplicates().reset_index(drop=True)
return result
df.groupby('Virus', as_index=False).apply(get_all_min).reset_index(drop=True)