如何将包含3个数字的对象转换为pandas中的三个独立列?



我在我的推文数据集上运行了一个情感分析模型,并创建了一个输出名为'scores'的新列。输出是3个概率的集合:第一个表示推文为负的概率,第二个表示推文为中性的概率,第三个表示推文为正的概率。例如:

[0.013780469, 0.94494355, 0.041276094]

下面是对'score'列的一些观察的截图

使用下面的代码:df.scores.dtype,我发现数据类型是一个对象。

我想为每个概率创建三个单独的列,'Negative', 'Neutral', "Positive'。因此,我想把"分数"分开。我该怎么做呢?

我已经试过了:

df[['Negative', 'Neutral', 'Positive']] = pd.DataFrame(df.scores.tolist(), index=df.index)

但是我得到一个错误提示:

ValueError: Columns must be same length as key

我也试过这个:

df[['Negative', 'Neutral', 'Positive']] = pd.DataFrame([ x.split('~') for x in df['scores'].tolist() ])

但是我得到一个错误提示:

AttributeError: 'float' object has no attribute 'split'

当使用str(x).split()代替x.split()时,我得到了这个错误:

ValueError: Columns must be same length as key

以下是执行print(df['scores'])时的输出:

0       [0.07552529 0.7626313  0.16184345]
1       [0.0552146  0.7753107  0.16947475]
2       [0.06891786 0.6625086  0.26857358]
3       [0.10522033 0.7078265  0.18695314]
4       [0.04945428 0.78878057 0.16176508]
...                
4976    [0.0196455  0.9556966  0.02465796]
4977    [0.02270025 0.94873595 0.02856365]
4978    [0.01378047 0.94494355 0.04127609]
4979    [0.05239033 0.9061995  0.04141007]
4980    [0.0651902  0.9061197  0.02869013]
Name: scores, Length: 4981, dtype: object

下面是我执行df.loc[0:5, "scores"].to_dict()时的输出:

{0: '[0.07552529 0.7626313  0.16184345]',
1: '[0.0552146  0.7753107  0.16947475]',
2: '[0.06891786 0.6625086  0.26857358]',
3: '[0.10522033 0.7078265  0.18695314]',
4: '[0.04945428 0.78878057 0.16176508]',
5: '[0.02224329 0.87228    0.10547666]'}

你可以试试这个方法:

import pandas as pd 
# Create some sample data
df = pd.DataFrame(columns=["scores"], data=["[0.013780469, 0.94494355, 0.041276094]",
"[0.013780469, 0.94494355, 0.941276094]",
"[0.513780469, 0.74494355, 0.041276094]",
"[0.813780469, 0.14494355, 0.541276094]"])
# First strip the unwanted characters and split by ", "
df[['Negative', 'Neutral', 'Positive']] = df.scores.str.replace("[", "", regex=True).replace("]", "", regex=True).str.split(", ", expand=True)
# Drop the original scores column
df.drop("scores", axis=1, inplace=True)
print(df)
输出:

Negative     Neutral     Positive
0  0.013780469  0.94494355  0.041276094
1  0.013780469  0.94494355  0.941276094
2  0.513780469  0.74494355  0.041276094
3  0.813780469  0.14494355  0.541276094

最新更新