获取数据框中非nan值(Python)的每行列名



我有一个数据框架,它有几个功能,一个功能可以有一个nan值。例如

feature1    feature2    feature3   feature4
10           NaN          5          2
2            1            3          1
NaN          2            4          NaN

注意:列也可以包含字符串。

我们怎么能得到一个列表/数组每行包含非nan值的列名?

因此,我的示例的结果数组将是:
res = array([feature1, feature3, feature4], [feature1, feature2, feature3, feature4], 
[feature2, feature3])

为了提高性能,使用列表推导将值转换为numpy数组:

c = df.columns.to_numpy()
res = [c[x].tolist() for x in df.notna().to_numpy()]
print (res)
[['feature1', 'feature3', 'feature4'], 
['feature1', 'feature2', 'feature3', 'feature4'], 
['feature2', 'feature3']]

df = pd.concat([df] * 1000, ignore_index=True)

In [28]: %%timeit
...: out = (df.stack().reset_index().groupby('level_0')['level_1']
...:          .agg(list).to_numpy().tolist()
...:        )
...:        
...: 
96.5 ms ± 8.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [29]: %%timeit
...: c = df.columns.to_numpy()
...: res = [c[x].tolist() for x in df.notna().to_numpy()]
...: 
3.36 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

您可以stack只保留非nan值,并与groupby.agg聚合为列表:

out = df.stack().reset_index().groupby('level_0')['level_1'].agg(list)

输出为Series:

level_0
0              [feature1, feature3, feature4]
1    [feature1, feature2, feature3, feature4]
2                        [feature2, feature3]
Name: level_1, dtype: object

列表:

out = (df.stack().reset_index().groupby('level_0')['level_1']
.agg(list).to_numpy().tolist()
)
输出:

[['feature1', 'feature3', 'feature4'],
['feature1', 'feature2', 'feature3', 'feature4'],
['feature2', 'feature3']]

最新更新