假设我在PySpark(一个我完全不熟悉的接口(中有一个数据帧df
,它有两列,一列的标签为"sports",只接受3个值("occer"、"basketball"、"排球"(,另一列的标记为"player_names",可以接受其条目中的任何字符串。
如何创建一个包含3列的新数据帧,第一列仅为"sports"列,第二列为"count",用于计算参加每项运动的唯一玩家名称的总数,第三列为"john_counts",用于比赛每项运动并包含字符串"john"的唯一玩家姓名的总数?
例如,如果我从一个具有2列['soccer', 'volleyball', 'basketball', 'basketball', 'soccer', 'soccer'], ['john doe', 'john wick', 'tom hanks', 'tom haverford', 'john doe', 'michael']
的数据帧开始,我希望以具有3列['soccer', 'basketball', 'volleyball'], [2,2,1], [1,0,1]
的数据帧结束。
在我的例子中,我的第一个专栏中实际上有大量的体育项目,因此避免使用小尺寸"体育项目"的解决方案将不胜感激。
编辑输入:
+----------+-------------+
| sports| player_names|
+----------+-------------+
| soccer| john doe|
|volleyball| john wick|
|basketball| tom hanks|
|basketball|tom haverford|
| soccer| john doe|
| soccer| michael|
+----------+-------------+
预期输出:
+----------+------+-----------+
| sports|counts|john_counts|
+----------+------+-----------+
| soccer| 2| 1|
|basketball| 2| 0|
|volleyball| 1| 1|
+----------+------+-----------+
非常感谢!
groupby.agg
,SQL语法:
import pyspark.sql.functions as f
df.dropDuplicates().groupby('sports').agg(
f.expr('count(*) as counts'),
f.expr('sum(case when player_names like "%john%" then 1 else 0 end) as john_counts')
).show()
+----------+------+-----------+
| sports|counts|john_counts|
+----------+------+-----------+
|basketball| 2| 0|
| soccer| 2| 1|
|volleyball| 1| 1|
+----------+------+-----------+
其中:
- 首先删除重复项以保证唯一性
- 按CCD_ 5分组并计数唯一玩家的数量
- 检查player_name是否包含
john
,并对true求和
或者,如果您喜欢非SQL语法:
df.dropDuplicates().groupby('sports').agg(
f.count('*').alias('counts'),
f.sum(f.col('player_names').contains('john').cast('long')).alias('john_counts')
).show()
+----------+------+-----------+
| sports|counts|john_counts|
+----------+------+-----------+
|basketball| 2| 0|
| soccer| 2| 1|
|volleyball| 1| 1|
+----------+------+-----------+
您需要删除dropduplicates,并根据count(*(和case进行简单分组,以获得结果
代码
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("soccer" ,"john doe"),
("volleyball","john wick"),
("basketball","tom hanks"),
("basketball","tom haverford"),
("soccer", "john doe"),
("soccer", "michael")
]
schema = StructType([
StructField('game', StringType(),True),
StructField('name', StringType(),True)
])
df = spark.createDataFrame(data=data, schema=schema).dropDuplicates()
df.createOrReplaceTempView("Data")
spark.sql("select game,count(*),(sum ( case when name like '%john%' then 1 else 0 end ) ) as jo_cnt from Data group by game").show()
带数据逐行-
>>> data = [
... ("soccer" ,"john doe"),
... ("volleyball","john wick"),
... ("basketball","tom hanks"),
... ("basketball","tom haverford"),
... ("soccer", "john doe"),
... ("soccer", "michael")
... ]
>>> schema = StructType([
... StructField('game', StringType(),True),
... StructField('name', StringType(),True)
... ])
>>> df = spark.createDataFrame(data=data, schema=schema).dropDuplicates()
>>> df.show()
+----------+-------------+
| game| name|
+----------+-------------+
|volleyball| john wick|
|basketball| tom hanks|
|basketball|tom haverford|
| soccer| john doe|
| soccer| michael|
+----------+-------------+
>>> df.createOrReplaceTempView("Data")
>>> spark.sql("select game,count(*),(sum ( case when name like '%john%' then 1 else 0 end ) ) as jo_cnt from Data group by game").show()
+----------+--------+------+
| game|count(1)|jo_cnt|
+----------+--------+------+
|basketball| 2| 0|
| soccer| 2| 1|
|volleyball| 1| 1|
+----------+--------+------+