如何基于2个自定义列表排序数据集?



我想根据两个自定义字典对df进行排序:

custom_dict = {'HC': 0, 'AMG HC': 1, 'S': 2, 'AMG S': 3,
'HCA':4, 'AMG HCA':5, 'MUP':6, 'AMG MUP':7}
custom_dict2 = {'Offline': 0, 'Online': 1, 'Independent':2}
df4 = df4.sort_values(by=['category','segment'], key=lambda x: x.map(custom_dict, custom_dict2))

我执行这段代码:

df4 = (df4.set_index(['category', 'segment'], append=True)
.unstack()
.swaplevel(axis=1)
.sort_index(level=0, axis=1, ascending=False)
.reindex([ytd1, ytd2, 'Evolution'], level=1, axis=1)
.reset_index('category')
#   .sort_index(level=-1, axis=0)
)

但是它不输出预期的输出。我想有以下顺序:离线的HC,离线的AMG HC,在线的HC,在线的AMG HC,独立的HC,独立的AMG HC,在线的S,在线的AMF等等。

你能告诉我怎么做吗?

数据集样本:

df4 = {'category': {0: 'HC', 1: 'AMG HC', 2: 'S', 3: 'AMG MUP', 4: 'MUP', 5: 'S', 6: 'AMG S', 7: 'HCA'}, 'segment': {0: 'Offline', 1: 'Offline', 2: 'Offline', 3: 'Offline', 4: 'Online', 5: 'Online', 6: 'Offline', 7: 'Independent'}, 'ytd2020': {0: '101142', 1: '38541', 2: '55653', 3: '19561', 4: '84921', 5: '99301', 6: '80212', 7: '95731'}, 'ytd2021': {0: '105726', 1: '39463', 2: '57537', 3: '21402', 4: '90310', 5: '97283', 6: '87011', 7: '119289'}, 'Evolution': {0: '4.5%', 1: '2.4%', 2: '2.1%', 3: '4.3%', 4: '8.2%', 5: '-2.0%', 6: '12.4%', 7: '24.6%'}}

您可以将数据设置为有序的类别。有几个优点,更小的内存消耗和更快的排序,一旦分类到位:

df4['category'] = pd.Categorical(df4['category'],
categories=list(custom_dict),
ordered=True)
df4['segment'] = pd.Categorical(df4['segment'],
categories=list(custom_dict2),
ordered=True)
df4 = df4.sort_values(by=['category','segment'])

NB。此解决方案不需要字典,按所需顺序排列的类别列表就足够了

示例输出(随机输入):

category      segment
11       HC      Offline
14       HC      Offline
1        HC       Online
5        HC  Independent
16       HC  Independent
19   AMG HC      Offline
15   AMG HC       Online
3         S      Offline
4         S      Offline
0         S  Independent
12        S  Independent
9     AMG S       Online
10    AMG S  Independent
2       HCA      Offline
6       HCA      Offline
17      HCA  Independent
7   AMG HCA      Offline
13  AMG HCA  Independent
8       MUP       Online
18  AMG MUP  Independent

sort_values中的key函数应该分别应用于categorysegment列,但是您正在尝试一次映射这两个列,从而产生不正确的输出。为了修复代码,我们可以创建一个额外的order字典,它可以帮助我们将列名映射到相应的映射字典

order = {'category': custom_dict, 'segment': custom_dict2}
df4.sort_values(['category', 'segment'], key=lambda s: s.map(order[s.name]))

category      segment ytd2020 ytd2021 Evolution
0       HC      Offline  101142  105726      4.5%
1   AMG HC      Offline   38541   39463      2.4%
2        S      Offline   55653   57537      2.1%
5        S       Online   99301   97283     -2.0%
6    AMG S      Offline   80212   87011     12.4%
7      HCA  Independent   95731  119289     24.6%
4      MUP       Online   84921   90310      8.2%
3  AMG MUP      Offline   19561   21402      4.3%

根据我对您上一个问题的理解,您的主要目标是将所有AMG XXX立即放置在同一段的相应XXX类别之后。例如:AMG HC紧接offline段中的HCAMG S的紧接

在同一个offline段中的S。因此,您可以使用:
idx = (df4[['category','segment']].apply(tuple, axis=1)
.sort_values(key=lambda x: x.str[0].map(custom_dict) * 10 + x.str[1].map(custom_dict2))
.index
)
df5 = df4.loc[idx]

请注意,我已经将排序的数据帧放入了一个新名称df5中,而不是覆盖df4。如果您觉得方便,您可以将df5更改为df4

结果:

print(df5)

category      segment ytd2020 ytd2021 Evolution
0       HC      Offline  101142  105726      4.5%
1   AMG HC      Offline   38541   39463      2.4%
2        S      Offline   55653   57537      2.1%
5        S       Online   99301   97283     -2.0%
6    AMG S      Offline   80212   87011     12.4%
7      HCA  Independent   95731  119289     24.6%
4      MUP       Online   84921   90310      8.2%
3  AMG MUP      Offline   19561   21402      4.3%

最新更新