在导入时被打乱的连接列

当在一个重复列名的csv上使用pd.read_csv('myfile.csv', delimiter=';')时，pandas用.1,.2,.#(#是重复列的编号)

我的示例csv看起来像这样:

text1

您可以重命名您的列，堆栈，连接，unstack:

df = pd.read_csv('filename.csv', sep=';')
# remove the '.x' in columns
df.columns = df.columns.map(lambda x: x.split('.')[0])
# reshaping
(df.set_index(['Data1', 'Data2']) # set those columns aside
.stack()                       # columns to rows
.groupby(level=[0,1,2])        # group by all
.apply(','.join)               # join duplicates
.unstack()                     # A/B/C back to columns
)

输出:

A            B            C
Data1 Data2                                 
abc   def    text1  text2,text3  text4,text5
asd   fgh    text2  text4,text3  text5,text1

从SO线程中获得灵感:

import pandas as pd
df = pd.read_csv(r'./example.csv', delimiter=';')
def sjoin(x): 
return ';'.join(x[x.notnull()].astype(str))
df = df.groupby(lambda col: col.split('.')[0], axis=1).apply(lambda x: x.apply(sjoin, axis=1))

结果是:

A            B            C Data1 Data2
0  text1  text2;text3  text4;text5   abc   def
1  text2  text4;text3  text5;text1   asd   fgh

这是另一种解决方案，使用正则表达式将混乱的列分组:

# (.*?): capture the original column name at the beginning of the string
# potentially followed by a dot and at least one digit
shared_groupname = r"(.*?)(?:.d+)?$"

让我们看看实际情况:

>>> df.columns.str.extract(shared_groupname) 
0
0  Data1
1  Data2
2      A
3      B
4      B
5      C
6      C

然后按这个原始列名分组，并应用连接:

grouped = df.groupby(df.columns.str.extract(shared_groupname, expand=False), axis=1)
res = grouped.apply(lambda x: x.dropna().astype(str).apply(', '.join, axis=1))

相关内容

最新更新

热门标签：