根据重复出现的列值将数据框拆分为多个数据框

  • 本文关键字:数据 拆分 pandas dataframe split
  • 更新时间 :
  • 英文 :


我收到了一个非常旧的系统的数据转储。我有一个定义数百个不同表的头(列)值的definition.txt文件。除此之外,我还拥有相应数量的expX.txt文件,其中包括没有任何列/标题信息的实际数据。下面的结构可以在这个文件中找到:

definition.txt
|  Type | ColumnName | Source   |
|-------|------------|----------|
| 1     | customerID | exp1.txt |
| 2     | name       | NaN      |
| 2     | surname    | NaN      |
| 3     | NaN        | NaN      | ← here i want to split
| 1     | materialID | exp2.txt |
| 2     | weight     | NaN      |
| 2     | dim        | NaN      |
| 3     | NaN        | NaN      | ← here i want to split
| 1     | orderID    | exp3.txt |
...
Wished output:
dataframe1
|  Type | ColumnName | Source   | 
|-------|------------|----------|
| 1     | customerID | exp1.txt |
| 2     | name       | NaN      |
| 2     | surname    | NaN      |
dataframe2
|  Type | ColumnName | Source   | 
|-------|------------|----------|
| 1     | materialID | exp2.txt |
| 2     | weight     | NaN      |
| 2     | dim        | NaN      |
...and so on
then i want to transpose the ColumnName into rows to create a table header.
After that i want to concernate the actual data from the expX.txt file definied on the Source Column.
Desired output for one example:
| CustomerID | name       | surname  | 
|------------|------------|----------|
| 125        | Max        | Cool     | line 1 in exp1.txt
| 126        | Peter      | Smith    | line 3 in exp1.txt
| 127        | Jon        | Doe      | line 3 in exp1.txt
...and so on                           ...

最终目标是创建一个.csv文件文件夹,表示不同的表。总的来说,我对Pandas和Python还是个新手,因此我不确定我是否使用了正确的工具集或思维模式来应对所面临的挑战。

您可以使用groupby:

out = [d for _,d in df.groupby(df['Source'].notna().cumsum())]

输出:

[   Type  ColumnName    Source
0     1  customerID  exp1.txt
1     2        name       NaN
2     2     surname       NaN
3     3         NaN       NaN,
Type  ColumnName    Source
4     1  materialID  exp2.txt
5     2      weight       NaN
6     2         dim       NaN
7     3         NaN       NaN,
Type ColumnName    Source
8     1    orderID  exp3.txt]

在循环中:

for i, (_,d) in enumerate(df.groupby(df['Source'].notna().cumsum()), start=1):
print(f'--- group {i} ---')
print(d)

输出:

--- group 1 ---
Type  ColumnName    Source
0     1  customerID  exp1.txt
1     2        name       NaN
2     2     surname       NaN
3     3         NaN       NaN
--- group 2 ---
Type  ColumnName    Source
4     1  materialID  exp2.txt
5     2      weight       NaN
6     2         dim       NaN
7     3         NaN       NaN
--- group 3 ---
Type ColumnName    Source
8     1    orderID  exp3.txt

最新更新