当在一个重复列名的csv上使用pd.read_csv('myfile.csv', delimiter=';')
时,pandas用.1
,.2
,.#
(#是重复列的编号)
我的示例csv看起来像这样:
text1
您可以重命名您的列,堆栈,连接,unstack:
df = pd.read_csv('filename.csv', sep=';')
# remove the '.x' in columns
df.columns = df.columns.map(lambda x: x.split('.')[0])
# reshaping
(df.set_index(['Data1', 'Data2']) # set those columns aside
.stack() # columns to rows
.groupby(level=[0,1,2]) # group by all
.apply(','.join) # join duplicates
.unstack() # A/B/C back to columns
)
输出:
A B C
Data1 Data2
abc def text1 text2,text3 text4,text5
asd fgh text2 text4,text3 text5,text1
从SO线程中获得灵感:
import pandas as pd
df = pd.read_csv(r'./example.csv', delimiter=';')
def sjoin(x):
return ';'.join(x[x.notnull()].astype(str))
df = df.groupby(lambda col: col.split('.')[0], axis=1).apply(lambda x: x.apply(sjoin, axis=1))
结果是:
A B C Data1 Data2
0 text1 text2;text3 text4;text5 abc def
1 text2 text4;text3 text5;text1 asd fgh
这是另一种解决方案,使用正则表达式将混乱的列分组:
# (.*?): capture the original column name at the beginning of the string
# potentially followed by a dot and at least one digit
shared_groupname = r"(.*?)(?:.d+)?$"
让我们看看实际情况:
>>> df.columns.str.extract(shared_groupname)
0
0 Data1
1 Data2
2 A
3 B
4 B
5 C
6 C
然后按这个原始列名分组,并应用连接:
grouped = df.groupby(df.columns.str.extract(shared_groupname, expand=False), axis=1)
res = grouped.apply(lambda x: x.dropna().astype(str).apply(', '.join, axis=1))