如何拆分标题行以创建新列



假设我有一个这样的df,

uid | 202012AP   | 202012CAN  | 202012CASA | 202012total  | 202101AP   | 202101CAN  | 202101CASA | 202101total  |...
A1  |   0        |   1        |   10       |    11        |     1      |   2        |   0        |   3          |...
A2  |   1        |   8        |   99       |    108       |     4      |   5        |   1        |   10         |...
B1  |   0        |   1        |   99       |    100       |     2      |   3        |   2        |   7          |...

我希望能够生成如下输出df,

uid | no     | reg  | amount
A1  | 202012 | AP   |   0
A1  | 202012 | CAN  |   1
A1  | 202012 | CASA |   10
A1  | 202101 | AP   |   1
A1  | 202101 | CAN  |   2
A1  | 202101 | CASA |   0
A2  | 202012 | AP   |   1
A2  | 202012 | CAN  |   8
A2  | 202012 | CASA |   99
A2  | 202101 | AP   |   4
A2  | 202101 | CAN  |   5
A2  | 202101 | CASA |   1

等等。

有熊猫图书馆来完成这项任务吗?如有任何帮助,我们将不胜感激。

附言:我实际上不需要01total02totalols来在输出df中使用,但我提供了它,因为这是df拥有的。

您可以使用melt、concat和sort_value的组合:

# create long df using melt, cols is just an intermediary column
res = df.melt(id_vars='uid', value_name='amount', var_name='cols')
# remove total rows
res = res[~res['cols'].str.endswith('total')]
# concat the two new columns no and reg with the previously melted df
res = pd.concat((res.drop('cols', 1), res['cols'].str.extract(r'(?P<no>d+)(?P<reg>w+)')), axis=1)
# sort both columns and rows to match the expected output
res = res.sort_values(['uid', 'no', 'reg'])[['uid', 'no', 'reg', 'amount']]
print(res)

输出

uid  no   reg  amount
0   A1  01  reg1       0
3   A1  01  reg2       1
6   A1  01  reg3      10
12  A1  02  reg1       1
15  A1  02  reg2       2
18  A1  02  reg3       0
1   A2  01  reg1       1
4   A2  01  reg2       8
7   A2  01  reg3      99
13  A2  02  reg1       4
16  A2  02  reg2       5
19  A2  02  reg3       1
2   B1  01  reg1       0
5   B1  01  reg2       1
8   B1  01  reg3      99
14  B1  02  reg1       2
17  B1  02  reg2       3
20  B1  02  reg3       2

请注意,我们还使用str.extract作为中间步骤:

res['cols'].str.extract(r'(?P<no>d+)(?P<reg>w+)'))

这是使用带有命名组的regex直接创建具有名称的列。

想法是先通过DataFrame.set_indexiud转换为index,然后通过Index.str.extract创建MultiIndex in columns,再通过DataFrame.stack重塑:

df1 = df.set_index('uid')
df1 = df1.loc[:, ~df1.columns.str.endswith('total')]
df1.columns = df1.columns.str.extract('(d+)(w+)').set_index([0,1]).index
df1 = df1.rename_axis(['no','reg'], axis=1).stack([0,1]).reset_index(name='amount')
print (df1)
uid  no   reg  amount
0   A1  01  reg1       0
1   A1  01  reg2       1
2   A1  01  reg3      10
3   A1  02  reg1       1
4   A1  02  reg2       2
5   A1  02  reg3       0
6   A2  01  reg1       1
7   A2  01  reg2       8
8   A2  01  reg3      99
9   A2  02  reg1       4
10  A2  02  reg2       5
11  A2  02  reg3       1
12  B1  01  reg1       0
13  B1  01  reg2       1
14  B1  01  reg3      99
15  B1  02  reg1       2
16  B1  02  reg2       3
17  B1  02  reg3       2

编辑:

df1 = df.set_index('uid')
df1 = df1.loc[:, ~df1.columns.str.endswith('total')]
df1.columns = df1.columns.str.extract('(d+)(w+)').set_index([0,1]).index
df1 = df1.rename_axis(['no','reg'], axis=1).stack([0,1]).reset_index(name='amount')
print (df1)
uid      no   reg  amount
0   A1  202012    AP       0
1   A1  202012   CAN       1
2   A1  202012  CASA      10
3   A1  202101    AP       1
4   A1  202101   CAN       2
5   A1  202101  CASA       0
6   A2  202012    AP       1
7   A2  202012   CAN       8
8   A2  202012  CASA      99
9   A2  202101    AP       4
10  A2  202101   CAN       5
11  A2  202101  CASA       1
12  B1  202012    AP       0
13  B1  202012   CAN       1
14  B1  202012  CASA      99
15  B1  202101    AP       2
16  B1  202101   CAN       3
17  B1  202101  CASA       2

只保留相关列:

box = df.filter(regex="uid|0[1-2]reg.")

通过使用正则表达式进行拆分来创建数据帧

cols = (
pd.Series(box.columns)
.str.split("(reg.)", expand=True)
.iloc[:, :2]
.fillna("") # this allows easy melting on "uid"
.set_axis(["no", "reg"], axis="columns")
)

cols作为多索引分配给box.columns:

box.columns = pd.MultiIndex.from_frame(cols)
box

no  uid                 01                       02
reg     reg1    reg2    reg3    reg1    reg2    reg3
0   A1     0    1         10    1          2    0
1   A2     1    8         99    4          5    1
2   B1     0    1         99    2          3    2

现在你可以融化了:

box.melt("uid")
uid no  reg value
0   A1  01  reg1    0
1   A2  01  reg1    1
2   B1  01  reg1    0
3   A1  01  reg2    1
4   A2  01  reg2    8
5   B1  01  reg2    1
6   A1  01  reg3    10
7   A2  01  reg3    99
8   B1  01  reg3    99
9   A1  02  reg1    1
10  A2  02  reg1    4
11  B1  02  reg1    2
12  A1  02  reg2    2
13  A2  02  reg2    5
14  B1  02  reg2    3
15  A1  02  reg3    0
16  A2  02  reg3    1
17  B1  02  reg3    2

更新(2021年1月6日(:pyjanitor的pivot_langer函数可以实现更简单的(意见挖掘(抽象;目前,您必须安装github的最新开发版本:

# install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
df.filter(regex="^(?!.*total)").pivot_longer(
index="uid",
names_pattern="(d+)(.+)",
names_to=("no", "reg"),
values_to="amount",
sort_by_appearance=True,
)
uid no     reg  amount
0   A1  202012  AP      0
1   A1  202012  CAN     1
2   A1  202012  CASA    10
3   A1  202101  AP      1
4   A1  202101  CAN     2
5   A1  202101  CASA    0
6   A2  202012  AP      1
7   A2  202012  CAN     8
8   A2  202012  CASA    99
9   A2  202101  AP      4
10  A2  202101  CAN     5
11  A2  202101  CASA    1
12  B1  202012  AP      0
13  B1  202012  CAN     1
14  B1  202012  CASA    99
15  B1  202101  AP      2
16  B1  202101  CAN     3
17  B1  202101  CASA    2

最新更新