一个熊猫数据帧拆分中的多个表



>我已经加载了一个看起来像这样的csv。

Unnamed: 0  Unnamed: 1  Unnamed: 2  Unnamed: 3
Date      Price     Quantity     Colour
2019        45         56         Blue
2019        34         65         Red
Date      Price     Quantity     Size
2019        12         32         Large
2019        45         11         Small
Date      Price     Country     
2019        12         UK        
2019        45         US         

我想把它做成一个完整的数据帧,所以预期的输出是这样的;

Date      Price     Quantity     Colour  Size  Country
2019        45         56         Blue   NaN    NaN    
2019        34         65         Red    NaN    NaN    
2019        12         32        NaN    Large   NaN    
2019        45         11        NaN    Small   NaN    
2019        12         32        NaN    NaN     UK
2019        45         11        NaN    NaN     US

第一列中的"日期"一词将始终拆分表格

如果可以使用Date字符串来区分每个组,然后比较并添加Series.cumsum,传递给groupby并在自定义函数中按第一行创建列并按DataFrame.iloc删除它:

g = df.iloc[:, 0].eq('Date').cumsum()
def f(x):
x.columns = x.iloc[0]
return x.iloc[1:].dropna(how='all', axis=1)
df = df.groupby(g).apply(f).reset_index(drop=True)
print (df)
Date Price Quantity Colour   Size
0  2019    45       56   Blue    NaN
1  2019    34       65    Red    NaN
2  2019    12       32    NaN  Large
3  2019    45       11    NaN  Small

我会使用np.arange来使用DataFrame.groupbypd.concat

new_df = (pd.concat([pd.DataFrame(columns = group.iloc[0],
index =group.index[1:],
data = group.iloc[1:].values)
for i,group in df.groupby(np.arange(len(df))//3)],sort = False)
.dropna(how = 'all',axis = 1))
print(new_df)
Date Price Quantity Colour   Size Country
1  2019    45       56   Blue    NaN     NaN
2  2019    34       65    Red    NaN     NaN
4  2019    12       32    NaN  Large     NaN
5  2019    45       11    NaN  Small     NaN
7  2019    12      NaN    NaN    NaN      UK
8  2019    45      NaN    NaN    NaN      US

如果要连接的每个帧的长度不同,那么我将使用Series.cumsum

new_df = (pd.concat([pd.DataFrame(columns = group.iloc[0],
index =group.index[1:],
data = group.iloc[1:].values)
for i,group in df.groupby(df.iloc[:,0].eq('date').cumsum())],sort = False)
.dropna(how = 'all',axis = 1))
print(new_df)
0  Date  Price  Quantity Colour
1  2019     45        56   Blue
2  2019     34        65    Red
3  Date  Price  Quantity   Size
4  2019     12        32  Large
5  2019     45        11  Small
6  Date  Price   Country    NaN
7  2019     12        UK    NaN
8  2019     45        US    NaN

最新更新