获得具有不同长度列表的列的稀疏df，并确保正确的合并

我有一个像这样的pandas数据框架:

>>> df1
col_1   col_2    labels
1        aaa     abc     [71020]
2        bbb     cde     [77085]
3        ccc     efg     [36415]
4        ddd     ghi     [99213, 99287]
5        eee     ijk     [99233, 71020, 36415]

和另一个像这样的数据框架(我在df1的labels列上使用sklearn.预处理. multilabelbinarizer得到了这个df):

>>> df2
71020  77085  36415  99213  99287  99233
1    1      0      0      0      0      0
2    0      1      0      0      0      0
3    0      0      1      0      0      0
4    0      0      0      1      1      0
5    1      0      1      0      0      1

和我现在想合并(或右连接)df2在df2上，键是labels列的值和df2的所有列名，如下所示:


col_1   col_2    labels                   71020  77085  36415  99213  99287  99233
1        aaa     abc     [71020]                     1      0      0      0      0      0
2        bbb     def     [77085]                     0      1      0      0      0      0
3        ccc     ghi     [36415]                     0      0      1      0      0      0
4        ddd     jkl     [99213, 99287]              0      0      0      1      1      0
5        eee     mno     [99233, 71020, 36415]       1      0      1      0      0      1

我该怎么做?

如果从Pandas创建df2，可以实现所需的输出(可能更容易)，如下所示:

从df1labels列创建df2a，如下所示:

# Assuming your `labels` column contain real list rather than string looks like list.  
# If not, we can convert the string to real list first
import ast
df1['labels'] = df1['labels'].apply(ast.literal_eval)
# Then, create the dummy table by Pandas
df2a = df['labels'].explode().astype(str).str.get_dummies().groupby(level=0).max()
# Optionally convert the column labels back to integer from string
df2a.columns = df2a.columns.astype(int)

结果:

print(df2a)
36415  71020  77085  99213  99233  99287
0      0      1      0      0      0      0
1      0      0      1      0      0      0
2      1      0      0      0      0      0
3      0      0      0      1      0      1
4      1      1      0      0      1      0

然后，我们可以连接df1来得到想要的输出:

df1.join(df2a)

结果:

col_1 col_2                 labels  36415  71020  77085  99213  99233  99287
0   aaa   abc                [71020]      0      1      0      0      0      0
1   bbb   cde                [77085]      0      0      1      0      0      0
2   ccc   efg                [36415]      1      0      0      0      0      0
3   ddd   ghi         [99213, 99287]      0      0      0      1      0      1
4   eee   ijk  [99233, 71020, 36415]      1      1      0      0      1      0

相关内容

最新更新

热门标签：