熊猫:是否有一种原生方法可以通过提供索引标签列表来对行进行排序



让我们取这个数据帧:

import pandas as pd
L0 = ['d','a','b','c','d','a','b','c','d','a','b','c']
L1 = ['z','z','z','z','x','x','x','x','y','y','y','y']
L2 = [1,6,3,8,7,6,7,6,3,5,6,5]
df = pd.DataFrame({"A":L0,"B":L1,"C":L2})
df = df.pivot(columns="A",index="B",values="C")

透视后,列和行按字母顺序排列。

对列重新排序很容易,可以使用自定义列标签列表来完成:

df = df[['d','a','b','c']]

但是对行重新排序没有这种直接功能,我能想到的最优雅的方式是使用列标签功能并前后转置:

df = df.T[['z','x','y']].T

例如,这样做根本没有效果:

df.loc[['x','y','z'],:] = df.loc[['z','x','y'],:]

是否没有直接的方法可以通过提供索引标签的自定义列表对数据帧的行进行排序?

你可以使用reindexreindex_axis,什么更快loc

对于index

idx = ['z','x','y']
df = df.reindex(idx)
print (df)
A  a  b  c  d
B            
z  6  3  8  1
x  6  7  6  7
y  5  6  5  3

或:

idx = ['z','x','y']
df = df.reindex_axis(idx)
print (df)
A  a  b  c  d
B            
z  6  3  8  1
x  6  7  6  7
y  5  6  5  3

正如SSM所指出的:

df = df.loc[['z', 'x', 'y'], :]
print (df)
A  a  b  c  d
B            
z  6  3  8  1
x  6  7  6  7
y  5  6  5  3

对于列:

cols = ['d','a','b','c']
df = df.reindex(columns=cols)
print (df)
A  d  a  b  c
B            
x  7  6  7  6
y  3  5  6  5
z  1  6  3  8
cols = ['d','a','b','c']
df = df.reindex_axis(cols, axis=1)
print (df)
A  d  a  b  c
B            
x  7  6  7  6
y  3  5  6  5
z  1  6  3  8

双:

idx = ['z','x','y']
cols = ['d','a','b','c']
df = df.reindex(columns=cols, index=idx)
print (df)
A  d  a  b  c
B            
z  1  6  3  8
x  7  6  7  6
y  3  5  6  5

时间

In [43]: %timeit (df.loc[['z', 'x', 'y'], ['d', 'a', 'b', 'c']])
1000 loops, best of 3: 653 µs per loop
In [44]: %timeit (df.reindex(columns=cols, index=idx))
1000 loops, best of 3: 402 µs per loop

仅索引:

In [49]: %timeit (df.reindex(idx))
The slowest run took 5.16 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 271 µs per loop
In [50]: %timeit (df.reindex_axis(idx))
The slowest run took 6.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 252 µs per loop

In [51]: %timeit (df.loc[['z', 'x', 'y']])
The slowest run took 5.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 418 µs per loop
In [52]: %timeit (df.loc[['z', 'x', 'y'], :])
The slowest run took 4.87 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 542 µs per loop

def pir(df):
    idx = ['z','x','y']
    a = df.index.values.searchsorted(idx)
    df = pd.DataFrame(
        df.values[a],
        df.index[a], df.columns
    )
    return df
In [63]: %timeit (pir(df))
The slowest run took 7.75 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 91.8 µs per loop

使用 loc 是一种非常自然的方法

df.loc[['z', 'x', 'y']]
A  d  a  b  c
B            
z  1  6  3  8
x  7  6  7  6
y  3  5  6  5

您可以使用以下命令将其分配回数据帧

df = df.loc[['z', 'x', 'y']]

两个轴一气呵成,loc

df.loc[['z', 'x', 'y'], ['d', 'a', 'b', 'c']]
A  d  a  b  c
B            
z  1  6  3  8
x  7  6  7  6
y  3  5  6  5

使用numpy.searchsorted快速完成此操作的方法

l = list('zxy')
a = df.index.values.searchsorted(l)
pd.DataFrame(
    df.values[a],
    df.index[a], df.columns
)
A  d  a  b  c
B            
z  1  6  3  8
x  7  6  7  6
y  3  5  6  5

最新更新