向数据帧中添加新列,计算某些组中的条目数



假设我在PySpark(一个我完全不熟悉的接口(中有一个数据帧df,它有两列,一列的标签为"sports",只接受3个值("occer"、"basketball"、"排球"(,另一列的标记为"player_names",可以接受其条目中的任何字符串。

如何创建一个包含3列的新数据帧,第一列仅为"sports"列,第二列为"count",用于计算参加每项运动的唯一玩家名称的总数,第三列为"john_counts",用于比赛每项运动并包含字符串"john"的唯一玩家姓名的总数?

例如,如果我从一个具有2列['soccer', 'volleyball', 'basketball', 'basketball', 'soccer', 'soccer'], ['john doe', 'john wick', 'tom hanks', 'tom haverford', 'john doe', 'michael']的数据帧开始,我希望以具有3列['soccer', 'basketball', 'volleyball'], [2,2,1], [1,0,1]的数据帧结束。

在我的例子中,我的第一个专栏中实际上有大量的体育项目,因此避免使用小尺寸"体育项目"的解决方案将不胜感激。


编辑输入:

+----------+-------------+
|    sports| player_names|
+----------+-------------+
|    soccer|     john doe|
|volleyball|    john wick|
|basketball|    tom hanks|
|basketball|tom haverford|
|    soccer|     john doe|
|    soccer|      michael|
+----------+-------------+

预期输出:

+----------+------+-----------+
|    sports|counts|john_counts|
+----------+------+-----------+
|    soccer|     2|          1|
|basketball|     2|          0|
|volleyball|     1|          1|
+----------+------+-----------+

非常感谢!

groupby.agg,SQL语法:

import pyspark.sql.functions as f
df.dropDuplicates().groupby('sports').agg(
f.expr('count(*) as counts'),
f.expr('sum(case when player_names like "%john%" then 1 else 0 end) as john_counts')
).show()
+----------+------+-----------+
|    sports|counts|john_counts|
+----------+------+-----------+
|basketball|     2|          0|
|    soccer|     2|          1|
|volleyball|     1|          1|
+----------+------+-----------+

其中:

  1. 首先删除重复项以保证唯一性
  2. 按CCD_ 5分组并计数唯一玩家的数量
  3. 检查player_name是否包含john,并对true求和

或者,如果您喜欢非SQL语法:

df.dropDuplicates().groupby('sports').agg(
f.count('*').alias('counts'),
f.sum(f.col('player_names').contains('john').cast('long')).alias('john_counts')
).show()
+----------+------+-----------+
|    sports|counts|john_counts|
+----------+------+-----------+
|basketball|     2|          0|
|    soccer|     2|          1|
|volleyball|     1|          1|
+----------+------+-----------+

您需要删除dropduplicates,并根据count(*(和case进行简单分组,以获得结果

代码

import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("soccer" ,"john doe"),
("volleyball","john wick"),
("basketball","tom hanks"),
("basketball","tom haverford"),
("soccer", "john doe"),
("soccer",  "michael")
]
schema = StructType([
StructField('game', StringType(),True), 
StructField('name', StringType(),True)
])
df = spark.createDataFrame(data=data, schema=schema).dropDuplicates()
df.createOrReplaceTempView("Data")
spark.sql("select game,count(*),(sum ( case when name like '%john%' then 1 else 0 end ) ) as jo_cnt from Data group by game").show()

带数据逐行-

>>> data = [
... ("soccer" ,"john doe"),
... ("volleyball","john wick"),
... ("basketball","tom hanks"),
... ("basketball","tom haverford"),
... ("soccer", "john doe"),
... ("soccer",  "michael")
...         ]
>>> schema = StructType([
... StructField('game', StringType(),True), 
... StructField('name', StringType(),True)
... ])
>>> df = spark.createDataFrame(data=data, schema=schema).dropDuplicates()
>>> df.show()
+----------+-------------+
|      game|         name|
+----------+-------------+
|volleyball|    john wick|
|basketball|    tom hanks|
|basketball|tom haverford|
|    soccer|     john doe|
|    soccer|      michael|
+----------+-------------+
>>> df.createOrReplaceTempView("Data")
>>> spark.sql("select game,count(*),(sum ( case when name like '%john%' then 1 else 0 end ) ) as jo_cnt from Data group by game").show()
+----------+--------+------+
|      game|count(1)|jo_cnt|
+----------+--------+------+
|basketball|       2|     0|
|    soccer|       2|     1|
|volleyball|       1|     1|
+----------+--------+------+

最新更新