我有一个像这样的pandas数据框架:
>>> df1
col_1 col_2 labels
1 aaa abc [71020]
2 bbb cde [77085]
3 ccc efg [36415]
4 ddd ghi [99213, 99287]
5 eee ijk [99233, 71020, 36415]
和另一个像这样的数据框架(我在df1
的labels
列上使用sklearn.预处理. multilabelbinarizer得到了这个df):
>>> df2
71020 77085 36415 99213 99287 99233
1 1 0 0 0 0 0
2 0 1 0 0 0 0
3 0 0 1 0 0 0
4 0 0 0 1 1 0
5 1 0 1 0 0 1
和我现在想合并(或右连接)df2
在df2
上,键是labels
列的值和df2
的所有列名,如下所示:
col_1 col_2 labels 71020 77085 36415 99213 99287 99233
1 aaa abc [71020] 1 0 0 0 0 0
2 bbb def [77085] 0 1 0 0 0 0
3 ccc ghi [36415] 0 0 1 0 0 0
4 ddd jkl [99213, 99287] 0 0 0 1 1 0
5 eee mno [99233, 71020, 36415] 1 0 1 0 0 1
我该怎么做?
如果从Pandas创建df2
,可以实现所需的输出(可能更容易),如下所示:
从df1
labels
列创建df2a
,如下所示:
# Assuming your `labels` column contain real list rather than string looks like list.
# If not, we can convert the string to real list first
import ast
df1['labels'] = df1['labels'].apply(ast.literal_eval)
# Then, create the dummy table by Pandas
df2a = df['labels'].explode().astype(str).str.get_dummies().groupby(level=0).max()
# Optionally convert the column labels back to integer from string
df2a.columns = df2a.columns.astype(int)
结果:
print(df2a)
36415 71020 77085 99213 99233 99287
0 0 1 0 0 0 0
1 0 0 1 0 0 0
2 1 0 0 0 0 0
3 0 0 0 1 0 1
4 1 1 0 0 1 0
然后,我们可以连接df1
来得到想要的输出:
df1.join(df2a)
结果:
col_1 col_2 labels 36415 71020 77085 99213 99233 99287
0 aaa abc [71020] 0 1 0 0 0 0
1 bbb cde [77085] 0 0 1 0 0 0
2 ccc efg [36415] 1 0 0 0 0 0
3 ddd ghi [99213, 99287] 0 0 0 1 0 1
4 eee ijk [99233, 71020, 36415] 1 1 0 0 1 0