>我已经加载了一个看起来像这样的csv。
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3
Date Price Quantity Colour
2019 45 56 Blue
2019 34 65 Red
Date Price Quantity Size
2019 12 32 Large
2019 45 11 Small
Date Price Country
2019 12 UK
2019 45 US
我想把它做成一个完整的数据帧,所以预期的输出是这样的;
Date Price Quantity Colour Size Country
2019 45 56 Blue NaN NaN
2019 34 65 Red NaN NaN
2019 12 32 NaN Large NaN
2019 45 11 NaN Small NaN
2019 12 32 NaN NaN UK
2019 45 11 NaN NaN US
第一列中的"日期"一词将始终拆分表格
如果可以使用Date
字符串来区分每个组,然后比较并添加Series.cumsum
,传递给groupby
并在自定义函数中按第一行创建列并按DataFrame.iloc
删除它:
g = df.iloc[:, 0].eq('Date').cumsum()
def f(x):
x.columns = x.iloc[0]
return x.iloc[1:].dropna(how='all', axis=1)
df = df.groupby(g).apply(f).reset_index(drop=True)
print (df)
Date Price Quantity Colour Size
0 2019 45 56 Blue NaN
1 2019 34 65 Red NaN
2 2019 12 32 NaN Large
3 2019 45 11 NaN Small
我会使用np.arange
来使用DataFrame.groupby
和pd.concat
:
new_df = (pd.concat([pd.DataFrame(columns = group.iloc[0],
index =group.index[1:],
data = group.iloc[1:].values)
for i,group in df.groupby(np.arange(len(df))//3)],sort = False)
.dropna(how = 'all',axis = 1))
print(new_df)
Date Price Quantity Colour Size Country
1 2019 45 56 Blue NaN NaN
2 2019 34 65 Red NaN NaN
4 2019 12 32 NaN Large NaN
5 2019 45 11 NaN Small NaN
7 2019 12 NaN NaN NaN UK
8 2019 45 NaN NaN NaN US
如果要连接的每个帧的长度不同,那么我将使用Series.cumsum
new_df = (pd.concat([pd.DataFrame(columns = group.iloc[0],
index =group.index[1:],
data = group.iloc[1:].values)
for i,group in df.groupby(df.iloc[:,0].eq('date').cumsum())],sort = False)
.dropna(how = 'all',axis = 1))
print(new_df)
0 Date Price Quantity Colour
1 2019 45 56 Blue
2 2019 34 65 Red
3 Date Price Quantity Size
4 2019 12 32 Large
5 2019 45 11 Small
6 Date Price Country NaN
7 2019 12 UK NaN
8 2019 45 US NaN