小贝子编程

格式化数据的最佳方法是邻接矩阵中的类别列表的最佳方法

本文关键字：方法最佳列表格式化数据邻接矩阵 python-3.x scikit-learn data-processing
更新时间 : 2023-09-09
英文 : What is the best method to format data that is a list of categories into an adjacency matrix?

我有计划输入Sklearn模型的数据。其中一些列是类别列表（其电影数据，例如，一个列是{genres: [comedy, horror]}）。

我该怎么做来处理这些列，以便将模型输入的内容是邻接矩阵，然后该行具有以下一些数据？

{comedy: 1, action: 0, horror: 1, documentary: 0}

您要寻找的预处理器是LabelBinarizer

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer
data = [{'genres': ['comedy', 'horror']}, {'genres': ['action', 'documentary']}]
df = pd.DataFrame(data)
# explode the list to separate rows
X = pd.concat([
        pd.DataFrame(v, index=np.repeat(k,len(v)), columns=['genre']) 
            for k,v in df.genres.to_dict().items()])
lb = LabelBinarizer()
# make the binary fields
dd = pd.DataFrame(lb.fit_transform(X), index=df_exploded.index, columns=lb.classes_)
dd.groupby(dd.index).max()

给出

   action  comedy  documentary  horror
0       0       1            0       1
1       1       0            1       0

格式化数据的最佳方法是邻接矩阵中的类别列表的最佳方法

相关内容

最新更新

热门标签：