Pandas DataFrame列拆分和求和



我有一个熊猫DataFrame,如下所示。

import pandas as pd
import numpy as np
df = pd.DataFrame([['Bay of Plenty', 'Bell Rd, Nukuhou','Nukuhou, Bay of Plenty'],[1.0, 0.5,1.0]]).T
df.columns = ['col1','col2']
col1    col2
0   Bay of Plenty   1
1   Bell Rd, Nukuhou    0.5
2   Nukuhou, Bay of Plenty  1

我想要得到以下输出。

col1    sum
Bay of Plenty   2.0     
Nukuhou     1.5     
Bell Rd     0.5 

我尝试了以下方法。

df["splited"]=df["col1"].str.split(",")
df = (df.explode("splited").reset_index(drop=True))
col1    col2    splited
0   Bay of Plenty   1   Bay of Plenty
1   Bell Rd, Nukuhou    0.5     Bell Rd
2   Bell Rd, Nukuhou    0.5     Nukuhou
3   Nukuhou, Bay of Plenty  1   Nukuhou
4   Nukuhou, Bay of Plenty  1   Bay of Plenty
df.groupby(['splited']).sum().reset_index() 

但这不是给总数吗?

您可以使用空间,:通过,进行split

#whitespaces
print(df["col1"].str.split(",").tolist())
[['Bay of Plenty'], ['Bell Rd', ' Nukuhou'], ['Nukuhou', ' Bay of Plenty']] 
^^^                      ^^^
#no whitespaces
print(df["col1"].str.split(", ").tolist())
[['Bay of Plenty'], ['Bell Rd', 'Nukuhou'], ['Nukuhou', 'Bay of Plenty']]

df["splited"]=df["col1"].str.split(", ")
df = df.explode("splited")
df = df.groupby('splited')['col2'].sum().reset_index() 
print(df)
splited  col2
0  Bay of Plenty   2.0
1        Bell Rd   0.5
2        Nukuhou   1.5

另一个想法是使用Series.str.strip来去除尾部空白:

df["splited"]=df["col1"].str.split(",")
df = df.explode("splited")
df = df.groupby(df['splited'].str.strip())['col2'].sum().reset_index() 
print(df)
splited  col2
0  Bay of Plenty   2.0
1        Bell Rd   0.5
2        Nukuhou   1.5

编辑:

如果需要在,后拆分一个或没有空格,则可以使用regex:

df = pd.DataFrame([['Bay of Plenty', 'Bell Rd, Nukuhou',
'Nukuhou,Bay of Plenty'],[1.0, 0.5,1.0]]).T
df.columns = ['col1','col2']
df["splited"]=df["col1"].str.split(",s*")

最新更新