我有一个3列的excel文件,我读取到熊猫基本上有k:v对列
stuff Unnamed: 1 Unnamed: 2
0 animal:dog NaN NaN
1 color:brown NaN NaN
2 age:12 NaN NaN
3 height: 60cm doctor: mike visit date: 01/12/21
4 unique animal id:3597230857 doctor id: 2 NaN
5 NaN NaN NaN
6 animal: cat NaN NaN
7 color: orage NaN NaN
8 age:21 NaN NaN
9 height: 40cm doctor: mike NaN
10 unique animal id: 95760203 doctor id: 2 visit date:05/21/20
11 NaN NaN NaN
12 animal: snake NaN NaN
13 color: orage NaN NaN
14 age:3 doctor:sally NaN
15 unique animal id: 325125 doctor id:1 visit date:05/21/20
16 NaN NaN NaN
17 NaN NaN NaN
18 animal: bird NaN NaN
19 color: blue NaN NaN
20 height: 40cm doctor:sally NaN
21 unique animal id: 496701275 doctor id:1 visit date:05/21/20
Process finished with exit code 0
我不仅需要将信息绑定在unnamed:1 &未命名的:2到唯一的动物ID因为这是我跟踪动物的方式但也调换了这些列在"; "是带有适当值的列标头。
animal color age height(cm) unique animal id doctor dictir id visit date
0 dog brown 12.0 60.0 3597230857 mike 2 2021-01-12
1 cat orage 21.0 40.0 95760203 mike 2 2020-05-21
2 snake orange 3.0 NaN 325125 sally 1 2020-05-21
3 bird blue NaN 40.0 496701275 sally 1 2020-05-21
当前正在处理
for source in sources:
df = pd.read_excel(source)
df['new_heading'] = df['stuff'].str.split(':').str[0]
df['new_value'] = df['stuff'].str.split(':').str[1]
df.loc[df['new_heading'] == "animal", 'new_aninmal'] = df['new_value']
df.loc[df['new_heading'] == "color", 'new_color'] = df['new_value']
df.loc[df['new_heading'] == "age", 'new_age'] = df['new_value']
df.loc[df['new_heading'] == "height", 'new_height'] = df['new_value']
df.loc[df['new_heading'] == "unique animal id", 'new_uid'] = df['new_value']
但它给出了这个瀑布输出(预期如此)
stuff Unnamed: 1 Unnamed: 2 new_heading new_value new_aninmal new_color new_age new_height new_uid
0 animal:dog NaN NaN animal dog dog NaN NaN NaN NaN
1 color:brown NaN NaN color brown NaN brown NaN NaN NaN
2 age:12 NaN NaN age 12 NaN NaN 12 NaN NaN
3 height: 60cm doctor: mike visit date: 01/12/21 height 60cm NaN NaN NaN 60cm NaN
4 unique animal id:3597230857 doctor id: 2 NaN unique animal id 3597230857 NaN NaN NaN NaN 3597230857
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 animal: cat NaN NaN animal cat cat NaN NaN NaN NaN
7 color: orage NaN NaN color orage NaN orage NaN NaN NaN
8 age:21 NaN NaN age 21 NaN NaN 21 NaN NaN
9 height: 40cm doctor: mike NaN height 40cm NaN NaN NaN 40cm NaN
10 unique animal id: 95760203 doctor id: 2 visit date:05/21/20 unique animal id 95760203 NaN NaN NaN NaN 95760203
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 animal: snake NaN NaN animal snake snake NaN NaN NaN NaN
13 color: orage NaN NaN color orage NaN orage NaN NaN NaN
14 age:3 doctor:sally NaN age 3 NaN NaN 3 NaN NaN
15 unique animal id: 325125 doctor id:1 visit date:05/21/20 unique animal id 325125 NaN NaN NaN NaN 325125
16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 animal: bird NaN NaN animal bird bird NaN NaN NaN NaN
19 color: blue NaN NaN color blue NaN blue NaN NaN NaN
20 height: 40cm doctor:sally NaN height 40cm NaN NaN NaN 40cm NaN
21 unique animal id: 496701275 doctor id:1 visit date:05/21/20 unique animal id 496701275 NaN NaN NaN NaN 496701275
每个部分都以animal
开头,因此创建一个计数器来确认它。计数器将用于数据的旋转:
reshape = (
df.assign(counter=np.where(df.stuff.str.startswith("animal:"), 1, 0).cumsum())
.melt("counter")
.dropna()
.drop(columns="variable")
)
在这里,我们用:
分割列并透视数据;计数器确保唯一索引,这是成功的枢轴所必需的:
(
reshape.value.str.split(":", expand=True)
.assign(counter=reshape.counter)
.pivot("counter", 0, 1)
)
age animal color doctor doctor id height unique animal id visit date
counter
1 12 dog brown mike 2 60cm 3597230857 01/12/21
3 21 cat orage mike 2 40cm 95760203 05/21/20
5 3 snake orage sally 1 NaN 325125 05/21/20
8 NaN bird blue sally 1 40cm 496701275 05/21/20
# find the animal group
df['index'] = df['stuff'].fillna('').str.contains('animal:').cumsum()
# stack to drop na and group animal group
obj = df.set_index('index').stack().droplevel(1)
# print(obj)
# split and expand, then strip blank space
df = obj.str.split(':', expand=True).applymap(str.strip)
# reset_index and unstack
dfn = df.set_index(0, append=True)[1].unstack()
# print(dfn)
输出:
print(obj)
index
1 animal:dog
1 color:brown
1 age:12
1 height: 60cm
1 doctor: mike
1 visit date: 01/12/21
1 unique animal id:3597230857
1 doctor id: 2
2 animal: cat
2 color: orage
2 age:21
2 height: 40cm
2 doctor: mike
2 unique animal id: 95760203
2 doctor id: 2
2 visit date:05/21/20
3 animal: snake
3 color: orage
3 age:3
3 doctor:sally
3 unique animal id: 325125
3 doctor id:1
3 visit date:05/21/20
4 animal: bird
4 color: blue
4 height: 40cm
4 doctor:sally
4 unique animal id: 496701275
4 doctor id:1
4 visit date:05/21/20
dtype: object
print(dfn)
age animal color doctor doctor id height unique animal id visit date
1 12 dog brown mike 2 60cm 3597230857 01/12/21
2 21 cat orage mike 2 40cm 95760203 05/21/20
3 3 snake orage sally 1 NaN 325125 05/21/20
4 NaN bird blue sally 1 40cm 496701275 05/21/20
样本数据:
import pandas as pd
import numpy as np
nan = np.nan
data = [{'stuff': 'animal:dog', 'Unnamed: 1': nan, 'Unnamed: 2': nan},
{'stuff': 'color:brown', 'Unnamed: 1': nan, 'Unnamed: 2': nan},
{'stuff': 'age:12', 'Unnamed: 1': nan, 'Unnamed: 2': nan},
{'stuff': 'height: 60cm',
'Unnamed: 1': 'doctor: mike',
'Unnamed: 2': 'visit date: 01/12/21'},
{'stuff': 'unique animal id:3597230857',
'Unnamed: 1': 'doctor id: 2',
'Unnamed: 2': nan},
{'stuff': nan, 'Unnamed: 1': nan, 'Unnamed: 2': nan},
{'stuff': 'animal: cat', 'Unnamed: 1': nan, 'Unnamed: 2': nan},
{'stuff': 'color: orage', 'Unnamed: 1': nan, 'Unnamed: 2': nan},
{'stuff': 'age:21', 'Unnamed: 1': nan, 'Unnamed: 2': nan},
{'stuff': 'height: 40cm', 'Unnamed: 1': 'doctor: mike', 'Unnamed: 2': nan},
{'stuff': 'unique animal id: 95760203',
'Unnamed: 1': 'doctor id: 2',
'Unnamed: 2': 'visit date:05/21/20'},
{'stuff': nan, 'Unnamed: 1': nan, 'Unnamed: 2': nan},
{'stuff': 'animal: snake', 'Unnamed: 1': nan, 'Unnamed: 2': nan},
{'stuff': 'color: orage', 'Unnamed: 1': nan, 'Unnamed: 2': nan},
{'stuff': 'age:3', 'Unnamed: 1': 'doctor:sally', 'Unnamed: 2': nan},
{'stuff': 'unique animal id: 325125',
'Unnamed: 1': 'doctor id:1',
'Unnamed: 2': 'visit date:05/21/20'},
{'stuff': nan, 'Unnamed: 1': nan, 'Unnamed: 2': nan},
{'stuff': nan, 'Unnamed: 1': nan, 'Unnamed: 2': nan},
{'stuff': 'animal: bird', 'Unnamed: 1': nan, 'Unnamed: 2': nan},
{'stuff': 'color: blue', 'Unnamed: 1': nan, 'Unnamed: 2': nan},
{'stuff': 'height: 40cm', 'Unnamed: 1': 'doctor:sally', 'Unnamed: 2': nan},
{'stuff': 'unique animal id: 496701275',
'Unnamed: 1': 'doctor id:1',
'Unnamed: 2': 'visit date:05/21/20'}]
df = pd.DataFrame(data)