在 Spark 中使用 groupby



我目前正在学习python的火花。我有一个小问题,在SQL等其他语言中,我们可以简单地按指定的列对表进行分组,然后对它们执行进一步的操作,如总和,计数等。我们如何在Spark中做到这一点?

我有这样的模式:

    [name:"ABC", city:"New York", money:"50"]
    [name:"DEF", city:"London", money:"10"]
    [name:"ABC", city:"New York", money:"30"]
    [name:"XYZ", city:"London", money:"20"]
    [name:"XYZ", city:"London", money:"100"]
    [name:"DEF", city:"London", money:"200"]

假设我想按城市对此进行分组,然后为每个名称执行总和。像这样:

    New York ABC 80
    London DEF 210
    London XYZ 120

您可以使用 SQL:

>>> sc.parallelize([
... {"name": "ABC", "city": "New York", "money":"50"},
... {"name": "DEF", "city": "London",   "money":"10"},
... {"name": "ABC", "city": "New York", "money":"30"},
... {"name": "XYZ", "city": "London",   "money":"20"},
... {"name": "XYZ", "city": "London",   "money":"100"},
... {"name": "DEF", "city": "London",   "money":"200"},
... ]).toDF().registerTempTable("df")
>>> sqlContext.sql("""SELECT name, city, sum(cast(money as bigint)) AS total 
... FROM df GROUP name, city""")

你也可以用Pythonic的方式(或者@LostInOverflow发布的SQL版本)来做到这一点:

grouped = df.groupby('city', 'name').sum('money')

看起来您的money列是字符串,因此您需要先将其转换为int(或以这种方式加载它):

df = df.withColumn('money', df['money'].cast('int'))

请记住,数据帧是不可变的,因此这两者都需要将它们分配给对象(即使它只是再次返回到df),然后使用show以查看结果。

编辑:我应该补充一点,您需要先创建一个数据帧。 对于简单数据,它与发布的 SQL 版本几乎相同,但将其分配给数据帧对象,而不是将其注册为表:

df = sc.parallelize([
    {"name": "ABC", "city": "New York", "money":"50"},
    {"name": "DEF", "city": "London",   "money":"10"},
    {"name": "ABC", "city": "New York", "money":"30"},
    {"name": "XYZ", "city": "London",   "money":"20"},
    {"name": "XYZ", "city": "London",   "money":"100"},
    {"name": "DEF", "city": "London",   "money":"200"},
    ]).toDF()

相关内容

  • 没有找到相关文章

最新更新