Python:获取四分位数的数组索引

我使用以下代码来计算给定数据集的四分位数：

#!/usr/bin/python
import numpy as np
series = [1,2,2,2,2,2,2,2,2,2,2,5,5,6,7,8]
p1 = 25
p2 = 50
p3 = 75
q1 = np.percentile(series,  p1)
q2 = np.percentile(series,  p2)
q3 = np.percentile(series,  p3)
print('percentile(' + str(p1) + '): ' + str(q1))
print('percentile(' + str(p2) + '): ' + str(q2))
print('percentile(' + str(p3) + '): ' + str(q3))

百分位数函数返回四分位数，但是，我也想获取用于标记四分位数边界的索引。有什么办法可以做到这一点吗？

由于数据是排序的，因此您可以使用numpy.searchsorted返回插入值以保持排序顺序的索引。您可以指定插入值的"边"。

>>> np.searchsorted(series,q1)
1
>>> np.searchsorted(series,q1,side='right')
11
>>> np.searchsorted(series,q2)
1
>>> np.searchsorted(series,q3)
11
>>> np.searchsorted(series,q3,side='right')
13

假设数据总是排序的(感谢@juanpa.arrivillaga)，你可以使用PandasSeries类中的rank方法。rank()需要几个论点。其中之一是pct：

pct ：布尔值，默认值为假

计算数据的百分比排名

计算百分比排名有不同的方法。这些方法由参数控制method：

方法： {'平均'， '最小'， '最大'， '

第一'， '密集'}

您需要的方法"max"：

最高

：组中最高等级

让我们看一下具有这些参数的rank()方法的输出：

import numpy as np
import pandas as pd
series = [1,2,2,2,2,2,2,2,2,2,2,5,5,6,7,8]
S = pd.Series(series)
percentage_rank = S.rank(method="max", pct=True)
print(percentage_rank)

这基本上为您提供了Series中每个条目的百分位数：

0     0.0625
1     0.6875
2     0.6875
3     0.6875
4     0.6875
5     0.6875
6     0.6875
7     0.6875
8     0.6875
9     0.6875
10    0.6875
11    0.8125
12    0.8125
13    0.8750
14    0.9375
15    1.0000
dtype: float64

为了检索三个百分位数的索引，您需要查找Series中第一个百分比排名等于或高于您感兴趣的百分位数的元素。该元素的索引是您需要的索引。

index25 = S.index[percentage_rank >= 0.25][0]
index50 = S.index[percentage_rank >= 0.50][0]
index75 = S.index[percentage_rank >= 0.75][0]
print("25 percentile: index {}, value {}".format(index25, S[index25]))
print("50 percentile: index {}, value {}".format(index50, S[index50]))
print("75 percentile: index {}, value {}".format(index75, S[index75]))

这将为您提供输出：

25 percentile: index 1, value 2
50 percentile: index 1, value 2
75 percentile: index 11, value 5

试试这个：

import numpy as np
import pandas as pd
series = [1,2,2,2,2,2,2,2,2,2,2,5,5,6,7,8]
thresholds = [25,50,75]
output = pd.DataFrame([np.percentile(series,x) for x in thresholds], index = thresholds, columns = ['quartiles'])
output

通过使其成为数据帧，您可以非常轻松地分配索引。

相关内容

最新更新

热门标签：