假设我有一个这样的df,
uid | 202012AP | 202012CAN | 202012CASA | 202012total | 202101AP | 202101CAN | 202101CASA | 202101total |...
A1 | 0 | 1 | 10 | 11 | 1 | 2 | 0 | 3 |...
A2 | 1 | 8 | 99 | 108 | 4 | 5 | 1 | 10 |...
B1 | 0 | 1 | 99 | 100 | 2 | 3 | 2 | 7 |...
我希望能够生成如下输出df,
uid | no | reg | amount
A1 | 202012 | AP | 0
A1 | 202012 | CAN | 1
A1 | 202012 | CASA | 10
A1 | 202101 | AP | 1
A1 | 202101 | CAN | 2
A1 | 202101 | CASA | 0
A2 | 202012 | AP | 1
A2 | 202012 | CAN | 8
A2 | 202012 | CASA | 99
A2 | 202101 | AP | 4
A2 | 202101 | CAN | 5
A2 | 202101 | CASA | 1
等等。
有熊猫图书馆来完成这项任务吗?如有任何帮助,我们将不胜感激。
附言:我实际上不需要01total或02totalols来在输出df中使用,但我提供了它,因为这是df拥有的。
您可以使用melt、concat和sort_value的组合:
# create long df using melt, cols is just an intermediary column
res = df.melt(id_vars='uid', value_name='amount', var_name='cols')
# remove total rows
res = res[~res['cols'].str.endswith('total')]
# concat the two new columns no and reg with the previously melted df
res = pd.concat((res.drop('cols', 1), res['cols'].str.extract(r'(?P<no>d+)(?P<reg>w+)')), axis=1)
# sort both columns and rows to match the expected output
res = res.sort_values(['uid', 'no', 'reg'])[['uid', 'no', 'reg', 'amount']]
print(res)
输出
uid no reg amount
0 A1 01 reg1 0
3 A1 01 reg2 1
6 A1 01 reg3 10
12 A1 02 reg1 1
15 A1 02 reg2 2
18 A1 02 reg3 0
1 A2 01 reg1 1
4 A2 01 reg2 8
7 A2 01 reg3 99
13 A2 02 reg1 4
16 A2 02 reg2 5
19 A2 02 reg3 1
2 B1 01 reg1 0
5 B1 01 reg2 1
8 B1 01 reg3 99
14 B1 02 reg1 2
17 B1 02 reg2 3
20 B1 02 reg3 2
请注意,我们还使用str.extract作为中间步骤:
res['cols'].str.extract(r'(?P<no>d+)(?P<reg>w+)'))
这是使用带有命名组的regex直接创建具有名称的列。
想法是先通过DataFrame.set_index
将iud
转换为index
,然后通过Index.str.extract
创建MultiIndex in columns
,再通过DataFrame.stack
重塑:
df1 = df.set_index('uid')
df1 = df1.loc[:, ~df1.columns.str.endswith('total')]
df1.columns = df1.columns.str.extract('(d+)(w+)').set_index([0,1]).index
df1 = df1.rename_axis(['no','reg'], axis=1).stack([0,1]).reset_index(name='amount')
print (df1)
uid no reg amount
0 A1 01 reg1 0
1 A1 01 reg2 1
2 A1 01 reg3 10
3 A1 02 reg1 1
4 A1 02 reg2 2
5 A1 02 reg3 0
6 A2 01 reg1 1
7 A2 01 reg2 8
8 A2 01 reg3 99
9 A2 02 reg1 4
10 A2 02 reg2 5
11 A2 02 reg3 1
12 B1 01 reg1 0
13 B1 01 reg2 1
14 B1 01 reg3 99
15 B1 02 reg1 2
16 B1 02 reg2 3
17 B1 02 reg3 2
编辑:
df1 = df.set_index('uid')
df1 = df1.loc[:, ~df1.columns.str.endswith('total')]
df1.columns = df1.columns.str.extract('(d+)(w+)').set_index([0,1]).index
df1 = df1.rename_axis(['no','reg'], axis=1).stack([0,1]).reset_index(name='amount')
print (df1)
uid no reg amount
0 A1 202012 AP 0
1 A1 202012 CAN 1
2 A1 202012 CASA 10
3 A1 202101 AP 1
4 A1 202101 CAN 2
5 A1 202101 CASA 0
6 A2 202012 AP 1
7 A2 202012 CAN 8
8 A2 202012 CASA 99
9 A2 202101 AP 4
10 A2 202101 CAN 5
11 A2 202101 CASA 1
12 B1 202012 AP 0
13 B1 202012 CAN 1
14 B1 202012 CASA 99
15 B1 202101 AP 2
16 B1 202101 CAN 3
17 B1 202101 CASA 2
只保留相关列:
box = df.filter(regex="uid|0[1-2]reg.")
通过使用正则表达式进行拆分来创建数据帧
cols = (
pd.Series(box.columns)
.str.split("(reg.)", expand=True)
.iloc[:, :2]
.fillna("") # this allows easy melting on "uid"
.set_axis(["no", "reg"], axis="columns")
)
将cols
作为多索引分配给box.columns
:
box.columns = pd.MultiIndex.from_frame(cols)
box
no uid 01 02
reg reg1 reg2 reg3 reg1 reg2 reg3
0 A1 0 1 10 1 2 0
1 A2 1 8 99 4 5 1
2 B1 0 1 99 2 3 2
现在你可以融化了:
box.melt("uid")
uid no reg value
0 A1 01 reg1 0
1 A2 01 reg1 1
2 B1 01 reg1 0
3 A1 01 reg2 1
4 A2 01 reg2 8
5 B1 01 reg2 1
6 A1 01 reg3 10
7 A2 01 reg3 99
8 B1 01 reg3 99
9 A1 02 reg1 1
10 A2 02 reg1 4
11 B1 02 reg1 2
12 A1 02 reg2 2
13 A2 02 reg2 5
14 B1 02 reg2 3
15 A1 02 reg3 0
16 A2 02 reg3 1
17 B1 02 reg3 2
更新(2021年1月6日(:pyjanitor的pivot_langer函数可以实现更简单的(意见挖掘(抽象;目前,您必须安装github的最新开发版本:
# install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
df.filter(regex="^(?!.*total)").pivot_longer(
index="uid",
names_pattern="(d+)(.+)",
names_to=("no", "reg"),
values_to="amount",
sort_by_appearance=True,
)
uid no reg amount
0 A1 202012 AP 0
1 A1 202012 CAN 1
2 A1 202012 CASA 10
3 A1 202101 AP 1
4 A1 202101 CAN 2
5 A1 202101 CASA 0
6 A2 202012 AP 1
7 A2 202012 CAN 8
8 A2 202012 CASA 99
9 A2 202101 AP 4
10 A2 202101 CAN 5
11 A2 202101 CASA 1
12 B1 202012 AP 0
13 B1 202012 CAN 1
14 B1 202012 CASA 99
15 B1 202101 AP 2
16 B1 202101 CAN 3
17 B1 202101 CASA 2