在执行逐行反聚合(使用Pandas)时,棘手的字段名到值的转换



我有一个数据集,我想将特定字段名转换为值,同时执行将值分解为自己唯一的行以及执行长枢轴。

Start       Date        End         Area    Final       Type    Middle Stat Low Stat    High Stat  Middle Stat1 Low Stat1    High Stat1
8/1/2013    9/1/2013    10/1/2013   NY      3/1/2023    CC      226         20          10         0             0            0
8/1/2013    9/1/2013    10/1/2013   CA      3/1/2023    AA      130         50          0          0             0            0



data = {
"Start": ['8/1/2013', '8/1/2013'],
"Date": ['9/1/2013', '9/1/2013'],
"End": ['10/1/2013', '10/1/2013'],
"Area": ['NY', 'CA'],
"Final": ['3/1/2023', '3/1/2023'],
"Type": ['CC', 'AA'],
"Middle Stat": [226, 130],
"Low Stat": [20, 50],
"High Stat": [10, 0],
"Middle Stat1": [0, 0],
"Low Stat1": [0, 0],
"High Stat1": [0, 0]
}

Start       Date        End         Area    Final       Type    Stat    Range   Stat1
8/1/2013    9/1/2013    10/1/2013   NY      3/1/2023    CC      20      Low     0
8/1/2013    9/1/2013    10/1/2013   CA      3/1/2023    AA      50      Low     0
8/1/2013    9/1/2013    10/1/2013   NY      3/1/2023    CC      226     Middle  0
8/1/2013    9/1/2013    10/1/2013   CA      3/1/2023    AA      130     Middle  0
8/1/2013    9/1/2013    10/1/2013   NY      3/1/2023    CC      10      High    0
8/1/2013    9/1/2013    10/1/2013   CA      3/1/2023    AA      0       High    0

我相信我必须注入某种宽到长方法,(SO成员协助)但不确定如何将此合并,同时在目标(感兴趣的列)列名中具有相同的后缀。

pd.wide_to_long(df, 
stubnames=['Low','Middle','High'],
i=['Start','Date','End','Area','Final'],
j='',
sep=' ',
suffix='(stat)'
).unstack(level=-1, fill_value=0).stack(level=0).reset_index()

欢迎提出任何建议。

#原始数据集

import pandas as pd
# create DataFrame
data = {'Start': ['9/1/2013', '10/1/2013', '11/1/2013', '12/1/2013'],
'Date': ['10/1/2016', '11/1/2016', '12/1/2016', '1/1/2017'],
'End': ['11/1/2016', '12/1/2016', '1/1/2017', '2/1/2017'],
'Area': ['NY', 'NY', 'NY', 'NY'],
'Final': ['3/1/2023', '3/1/2023', '3/1/2023', '3/1/2023'],
'Type': ['CC', 'CC', 'CC', 'CC'],
'Low Stat': ['', '', '', ''],
'Low Stat1': ['', '', '', ''],
'Middle Stat': ['0', '0', '0', '0'],
'Middle Stat1': ['0', '0', '0', '0'],
'Re': ['','','',''],
'Set': ['0', '0', '0', '0'],
'Set2': ['0', '0', '0', '0'],
'Set3': ['0', '0', '0', '0'],
'High Stat': ['', '', '', ''],
'High Stat1': ['', '', '', '']}
df = pd.DataFrame(data)
df.melt(id_vars=df.columns[:6], value_name='Values')
Start      Date        End Area     Final Type     variable  Values
0  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023   CC  Middle Stat    226
1  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023   AA  Middle Stat    130
2  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023   CC    Low Stat      20
3  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023   AA    Low Stat      50
4  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023   CC   High Stat      10
5  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023   AA   High Stat       0

一个选项是使用pyjanitor的pivot_longer -在这种情况下,我们使用特殊占位符.value来标识我们希望保留为标题的列的部分,而其余部分则被整理成一个新列:

# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = slice('Start', 'Type'), 
names_to = ("Range", ".value"), 
names_sep = " ")
)
Start      Date        End Area     Final Type   Range  Stat  Stat1
0  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023   CC  Middle   226      0
1  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023   AA  Middle   130      0
2  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023   CC     Low    20      0
3  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023   AA     Low    50      0
4  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023   CC    High    10      0
5  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023   AA    High     0      0

您可以尝试先重命名列:

import re
df = df.rename(columns=lambda x: re.sub(r'(Low|Middle|High) Stat', r'Stat1', x))
x = pd.wide_to_long(df,
stubnames='Stat',
i=['Start','Date','End','Area','Final'],
j='Range', suffix=r'(?:Low|Middle|High)').reset_index()
print(x)

打印:

Start      Date        End Area     Final   Range Type  Stat
0  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023  Middle   CC   226
1  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023     Low   CC    20
2  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023    High   CC    10
3  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023  Middle   AA   130
4  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023     Low   AA    50
5  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023    High   AA     0

编辑:有更多的存根名称:

import re
df = df.rename(columns=lambda x: re.sub(r"(Low|Middle|High) Stat$", r"Stat1", x))
df = df.rename(columns=lambda x: re.sub(r"(Low|Middle|High) Stat1$", r"1Stat1", x))
x = (
pd.wide_to_long(
df,
stubnames=["Stat", "1Stat"],
i=["Start", "Date", "End", "Area", "Final"],
j="Range",
suffix=r"(?:Low|Middle|High)",
)
.reset_index()
.rename(columns={"1Stat": "Stat1"})
)

打印:

Start      Date        End Area     Final   Range Type  Stat  Stat1
0  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023  Middle   CC   226      0
1  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023     Low   CC    20      0
2  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023    High   CC    10      0
3  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023  Middle   AA   130      0
4  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023     Low   AA    50      0
5  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023    High   AA     0      0

最新更新