我收到了一个非常旧的系统的数据转储。我有一个定义数百个不同表的头(列)值的definition.txt文件。除此之外,我还拥有相应数量的expX.txt文件,其中包括没有任何列/标题信息的实际数据。下面的结构可以在这个文件中找到:
definition.txt
| Type | ColumnName | Source |
|-------|------------|----------|
| 1 | customerID | exp1.txt |
| 2 | name | NaN |
| 2 | surname | NaN |
| 3 | NaN | NaN | ← here i want to split
| 1 | materialID | exp2.txt |
| 2 | weight | NaN |
| 2 | dim | NaN |
| 3 | NaN | NaN | ← here i want to split
| 1 | orderID | exp3.txt |
...
Wished output:
dataframe1
| Type | ColumnName | Source |
|-------|------------|----------|
| 1 | customerID | exp1.txt |
| 2 | name | NaN |
| 2 | surname | NaN |
dataframe2
| Type | ColumnName | Source |
|-------|------------|----------|
| 1 | materialID | exp2.txt |
| 2 | weight | NaN |
| 2 | dim | NaN |
...and so on
then i want to transpose the ColumnName into rows to create a table header.
After that i want to concernate the actual data from the expX.txt file definied on the Source Column.
Desired output for one example:
| CustomerID | name | surname |
|------------|------------|----------|
| 125 | Max | Cool | line 1 in exp1.txt
| 126 | Peter | Smith | line 3 in exp1.txt
| 127 | Jon | Doe | line 3 in exp1.txt
...and so on ...
最终目标是创建一个.csv文件文件夹,表示不同的表。总的来说,我对Pandas和Python还是个新手,因此我不确定我是否使用了正确的工具集或思维模式来应对所面临的挑战。
您可以使用groupby
:
out = [d for _,d in df.groupby(df['Source'].notna().cumsum())]
输出:
[ Type ColumnName Source
0 1 customerID exp1.txt
1 2 name NaN
2 2 surname NaN
3 3 NaN NaN,
Type ColumnName Source
4 1 materialID exp2.txt
5 2 weight NaN
6 2 dim NaN
7 3 NaN NaN,
Type ColumnName Source
8 1 orderID exp3.txt]
在循环中:
for i, (_,d) in enumerate(df.groupby(df['Source'].notna().cumsum()), start=1):
print(f'--- group {i} ---')
print(d)
输出:
--- group 1 ---
Type ColumnName Source
0 1 customerID exp1.txt
1 2 name NaN
2 2 surname NaN
3 3 NaN NaN
--- group 2 ---
Type ColumnName Source
4 1 materialID exp2.txt
5 2 weight NaN
6 2 dim NaN
7 3 NaN NaN
--- group 3 ---
Type ColumnName Source
8 1 orderID exp3.txt