50% 时的 CDF x 值和平均值不显示相同的数字

我有一个dataframe，我创建了days列的CDF：

...
#create DF from SQL
df = pd.read_sql_query(query, conn)
days = df['days'].dropna()
#create CDF definition
def ecdf(data):
    n = len(data)
    x = np.sort(data)
    y = np.arange(1.0, n+1) / n
    return x, y
#unpack x and y
x, y = ecdf(days)
sns.set()
#plot CDF
ax = plt.plot(x, y, marker='.', linestyle='none') 
#Overlay quartiles
percentiles= np.array([25,50,75])
x_p = np.percentile(days, percentiles)
y_p = percentiles/100.0
ax = plt.plot(x_p, y_p, marker='D', color='red', linestyle='none') # Overlay percentiles
#get current axes and add annotation and quartile points
ax=plt.gca()
for x,y in zip(x_p, y_p):                                        
    ax.annotate('%s' % x, xy=(x,y), xytext=(15,0), textcoords='offset points')

在50％标记处，CDF覆盖层中的数据点显示我 120 平均值，但是print(np.mean(df['days_to_engaged']))给我 154 。

。

为什么差异？

print(df['days'].dropna())：

您将中位数与平均值进行比较。这归结为以下内容：

a = np.array([1, 1, 2, 4])

ecdf只是第二个元素(1(。而平均值为 (4 + 2 + 1 + 1) / 4 == 2。

相关内容

最新更新

热门标签：