用低于/小于的观测值的百分比替换值



我有一个像这样的df:

>>> a = [1, 2, 3, 4, 5, 6, 7, 8]
>>> df = pd.DataFrame({'a': a})
>>> df
a
0  1
1  2
2  3
3  4
4  5
5  6
6  7
7  8

我想用显示有多少观测值小于该值(以百分比表示)的值来替换这些值。这样的:

>>> df
a  how_many_percent_of_observations_are_less_than_value_from_a
0  1  0 (no observations that are lower, 0/8)
1  2  .125 (one observation is lower, 1/8)
2  3  .25 (two observations are lower, 2/8)
3  4  
4  5  
5  6  
6  7  
7  8  .875 (7 observations are lower, 7/8)

您可以使用numpy广播进行测试,如果a的值不太像相同的值,那么计算每个'columns'True的数量并除以数组的长度:

a = df.a.to_numpy()
print (a[:, None] < a)
[[False  True  True  True  True  True  True  True]
[False False  True  True  True  True  True  True]
[False False False  True  True  True  True  True]
[False False False False  True  True  True  True]
[False False False False False  True  True  True]
[False False False False False False  True  True]
[False False False False False False False  True]
[False False False False False False False False]]

df['new'] = (a[:, None] < a).sum(axis=0) / len(a)
print (df)
a    new
0  1  0.000
1  2  0.125
2  3  0.250
3  4  0.375
4  5  0.500
5  6  0.625
6  7  0.750
7  8  0.875

使用rank

a = [1, 2, 3, 4, 5, 6, 7, 8]
df = pd.DataFrame({'a': a})
ranks = df['a'].rank(method = 'min')
maxi = ranks.size
df['b'] = (ranks-1)/maxi

输出:

>>> df
a      b
0  1  0.000
1  2  0.125
2  3  0.250
3  4  0.375
4  5  0.500
5  6  0.625
6  7  0.750
7  8  0.875

您可以在这里使用np.searchsortedndarray.argsort

a = df.a.to_numpy()
idx = a.argsort()
df['new'] = np.searchsorted(a[idx], a) / len(df)
df
a    new
0  1  0.000
1  2  0.125
2  3  0.250
3  4  0.375
4  5  0.500
5  6  0.625
6  7  0.750
7  8  0.875

时间分析:

基准测试设置

a = np.array([1, 2, 3, 4, 5, 6, 7, 8])
a = a.repeat(1_000_000)
np.random.shuffle(a)
a = a[:1_000_000]
df = pd.DataFrame({'a': a})

结果:

In [69]: %%timeit 
...: a = df.a.to_numpy() 
...: (a[:, None] < a).sum(axis=0) / len(a) 
...:  
...:  
MemoryError: Unable to allocate 931. GiB for an array with shape (1000000, 1000000) and data type bool
In [70]: %%timeit 
...: a = df.a.to_numpy() 
...: idx = a.argsort() 
...: np.searchsorted(a[idx], a) / len(df) 
...:  
...:                                                                        
96 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [71]: %%timeit 
...: ranks = df['a'].rank() 
...: maxi = ranks.max() 
...: (ranks-1)/maxi 
...:  
...:                                                                        
86 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

对于小数据,基准设置

a = a[:10_000]
df = pd.DataFrame({'a': a})

结果:

In [73]: %%timeit 
...: ranks = df['a'].rank() 
...: maxi = ranks.max() 
...: (ranks-1)/maxi 
...:  
...:                                                                        
1.29 ms ± 205 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [74]: %%timeit 
...: a = df.a.to_numpy() 
...: idx = a.argsort() 
...: np.searchsorted(a[idx], a) / len(df) 
...:  
...:                                                                        
684 µs ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [75]: %%timeit 
...: a = df.a.to_numpy() 
...: (a[:, None] < a).sum(axis=0) / len(a) 
...:  
...:                                                                        
122 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

平等检查

ranks = df['a'].rank()
maxi = ranks.max()
ris = ((ranks-1)/maxi).to_numpy() 
jez = (a[:, None] < a).sum(axis=0) / len(a) 
idx = a.argsort()
ch3 = np.searchsorted(a[idx], a) / len(df)
(jez == ch3).all()
# True
(jez == ris).all()
# False

最新更新