如何使用 np.histogram() 查找列中最常用的值

我有一个数据帧，其中一列包含不同的数值。我想找到最常用的值，特别是使用 np.histogram(( 函数。

我知道可以使用 column.value_counts((.nlargest(1( 等函数来完成此任务，但是，我对如何使用 np.histogram(( 函数来实现此目标感兴趣。

通过这项任务，我希望更好地了解函数和结果值，因为文档 (https://numpy.org/doc/1.18/reference/generated/numpy.histogram.html( 中的描述对我来说不是很清楚。

下面我分享一个用于此任务的示例系列值：

data = pd.Series(np.random.randint(1,10,size=100))

这是一种方法：

import numpy as np
import pandas as pd
# Make data
np.random.seed(0)
data = pd.Series(np.random.randint(1, 10, size=100))
# Make bins
bins = np.arange(data.min(), data.max() + 2)
# Compute histogram
h, _ = np.histogram(data, bins)
# Find most frequent value
mode = bins[h.argmax()]
# Mode computed with Pandas
mode_pd = data.value_counts().nlargest(1).index[0]
# Check result
print(mode == mode_pd)
# True

您还可以将bins定义为：

bins = np.unique(data)
bins = np.append(bins, bins[-1] + 1)

或者，如果您的数据仅包含正数，则可以直接使用np.bincount：

mode = np.bincount(data).argmax()

当然还有scipy.stats.mode：

import scipy.stats
mode = scipy.stats.mode(data)[0][0]

可以通过以下方式完成：

hist, bin_edges = np.histogram(data, bins=np.arange(0.5,10.5))
result = np.argmax(hist)

您只需要更仔细地阅读文档。它说，如果bins是[1, 2, 3, 4]那么第一个箱是[1, 2)，第二个是[2, 3)，第三个是[3, 4)。

我们计算箱中的数字数量[0.5, 1.5)、[1.5, 2.5)、...、[8.5, 9.5)特别是在您的问题中，并选择最大数字的索引。

以防万一，值得使用

np.unique(data)[np.argmax(hist)]

如果我们不确定您的排序数据集np.unique(data)是否包含所有连续的整数 0、1、2、3、...

相关内容

最新更新

热门标签：