使用 Spark 数据帧列制作直方图

我正在尝试用数据帧中的一列制作直方图，看起来像

DataFrame[C0: int, C1: int, ...]

如果我要用 C1 列制作直方图，我该怎么办？

我尝试过的一些事情是

df.groupBy("C1").count().histogram()
df.C1.countByValue()

由于数据类型不匹配，这不起作用。

@Chris van den Berg 提到的pyspark_dist_explore包非常好。如果您不想添加额外的依赖项，则可以使用此代码来绘制简单的直方图。

import matplotlib.pyplot as plt
# Show histogram of the 'C1' column
bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)
# This is a bit awkward but I believe this is the correct way to do it 
plt.hist(bins[:-1], bins=bins, weights=counts)

对我有用的是

df.groupBy("C1").count().rdd.values().histogram()

我必须转换为RDD，因为我在pyspark中找到了histogram方法。RDD类，但不在火花中。SQL 模块

您可以使用histogram_numeric Hive UDAF：

import random
random.seed(323)
sqlContext = HiveContext(sc)
n = 3  # Number of buckets
df = sqlContext.createDataFrame(
    sc.parallelize(enumerate(random.random() for _ in range(1000))),
   ["id", "v"]
)
hists = df.selectExpr("histogram_numeric({0}, {1})".format("v", n))
hists.show(1, False)
## +------------------------------------------------------------------------------------+
## |histogram_numeric(v,3)                                                              |
## +------------------------------------------------------------------------------------+
## |[[0.2124888140177466,415.0], [0.5918851340384337,330.0], [0.8890271451209697,255.0]]|
## +------------------------------------------------------------------------------------+

您还可以提取感兴趣的列并在RDD上使用histogram方法：

df.select("v").rdd.flatMap(lambda x: x).histogram(n)
## ([0.002028109534323752,
##  0.33410233677189705,
##  0.6661765640094703,
##  0.9982507912470436],
## [327, 326, 347])

假设您在 C1 中的值介于 1-1000 之间，并且您希望获得 10 个箱的直方图。您可以执行以下操作：df.withColumn（"bins"， df.C1/100）.groupBy（"bins"）.count（）如果你的分箱更复杂，你可以为它创建一个UDF（更糟糕的是，你可能需要先分析列，例如通过使用描述或通过其他方法）。

如果你想绘制

直方图，你可以使用 pyspark_dist_explore 包：

fig, ax = plt.subplots()
hist(ax, df.groupBy("C1").count().select("count"))

如果您希望在熊

猫数据帧中获取数据，可以使用：

pandas_df = pandas_histogram(df.groupBy("C1").count().select("count"))

一种简单的方法是

import pandas as pd
x = df.select('symboling').toPandas()  # symboling is the column for histogram
x.plot(kind='hist')

相关内容

最新更新

热门标签：