我有一个熊猫数据帧,看起来大致像
foo foo2 foo3 foo4
a NY WA AZ NaN
b DC NaN NaN NaN
c MA CA NaN NaN
我想制作一个嵌套的观察结果列表,但省略 NaN 值,所以我有类似 [['NY'、'WA'、'AZ']、['DC']、['MA',CA'] 之类的东西。
此数据帧中存在一种模式,如果这有所不同,则如果 fooX 为空,则后续列 fooY 也将为空。
我最初在下面有类似这样的代码。我相信有更好的方法可以做到这一点
A = [[i] for i in subset_label['label'].tolist()]
B = [i for i in subset_label['label2'].tolist()]
C = [i for i in subset_label['label3'].tolist()]
D = [i for i in subset_label['label4'].tolist()]
out_list = []
for index, row in subset_label.iterrows():
out_list.append([row.label, row.label2, row.label3, row.label4])
out_list
选项 1
pd.DataFrame.stack
默认删除 na。
df.stack().groupby(level=0).apply(list).tolist()
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
___
选项 2
有趣的替代方案,因为我认为对熊猫对象中的列表求和很有趣。
df.applymap(lambda x: [x] if pd.notnull(x) else []).sum(1).tolist()
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
选项 3numpy
实验
nn = df.notnull().values
sliced = df.values.ravel()[nn.ravel()]
splits = nn.sum(1)[:-1].cumsum()
[s.tolist() for s in np.split(sliced, splits)]
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
试试这个:
In [77]: df.T.apply(lambda x: x.dropna().tolist()).tolist()
Out[77]: [['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
这是一个矢量化版本!
original = pd.DataFrame(data={
'foo': ['NY', 'DC', 'MA'],
'foo2': ['WA', np.nan, 'CA'],
'foo3': ['AZ', np.nan, np.nan],
'foo4': [np.nan] * 3,
})
out = original.copy().fillna('NAN')
# Build up mapping such that each non-nan entry is mapped to [entry]
# and nan entries are mapped to []
unique_entries = np.unique(out.values)
mapping = {e: [e] for e in unique_entries}
mapping['NAN'] = []
# Apply mapping
for c in original.columns:
out[c] = out[c].map(mapping)
# Concatenate the lists along axis 1
out.sum(axis=1)
你应该得到类似的东西
0 [NY, WA, AZ]
1 [DC]
2 [MA, CA]
dtype: object