我目前正在学习python的火花。我有一个小问题,在SQL等其他语言中,我们可以简单地按指定的列对表进行分组,然后对它们执行进一步的操作,如总和,计数等。我们如何在Spark中做到这一点?
我有这样的模式:
[name:"ABC", city:"New York", money:"50"]
[name:"DEF", city:"London", money:"10"]
[name:"ABC", city:"New York", money:"30"]
[name:"XYZ", city:"London", money:"20"]
[name:"XYZ", city:"London", money:"100"]
[name:"DEF", city:"London", money:"200"]
假设我想按城市对此进行分组,然后为每个名称执行总和。像这样:
New York ABC 80
London DEF 210
London XYZ 120
您可以使用 SQL:
>>> sc.parallelize([
... {"name": "ABC", "city": "New York", "money":"50"},
... {"name": "DEF", "city": "London", "money":"10"},
... {"name": "ABC", "city": "New York", "money":"30"},
... {"name": "XYZ", "city": "London", "money":"20"},
... {"name": "XYZ", "city": "London", "money":"100"},
... {"name": "DEF", "city": "London", "money":"200"},
... ]).toDF().registerTempTable("df")
>>> sqlContext.sql("""SELECT name, city, sum(cast(money as bigint)) AS total
... FROM df GROUP name, city""")
你也可以用Pythonic的方式(或者@LostInOverflow发布的SQL版本)来做到这一点:
grouped = df.groupby('city', 'name').sum('money')
看起来您的money
列是字符串,因此您需要先将其转换为int
(或以这种方式加载它):
df = df.withColumn('money', df['money'].cast('int'))
请记住,数据帧是不可变的,因此这两者都需要将它们分配给对象(即使它只是再次返回到df
),然后使用show
以查看结果。
编辑:我应该补充一点,您需要先创建一个数据帧。 对于简单数据,它与发布的 SQL 版本几乎相同,但将其分配给数据帧对象,而不是将其注册为表:
df = sc.parallelize([
{"name": "ABC", "city": "New York", "money":"50"},
{"name": "DEF", "city": "London", "money":"10"},
{"name": "ABC", "city": "New York", "money":"30"},
{"name": "XYZ", "city": "London", "money":"20"},
{"name": "XYZ", "city": "London", "money":"100"},
{"name": "DEF", "city": "London", "money":"200"},
]).toDF()