如何在每组B中选择A列的唯一值，得到这些唯一值在C列中的和

我有posts dataframe和authors dataframe

我需要计算每天文章作者的favCount

posts_columns = ["postId", "authorId", "date"]
posts = [("1", "1", "2020-10-10"),
("2", "2", "2020-10-10"),
("3", "2", "2020-10-10"),
("4", "2", "2020-10-11")
("5", "3", "2020-10-11")
("6", "3", "2020-10-11")]
authors_columns = ["authorId", "favCount"]
authors = [
("1", "5"),
("2", "3"),
("3", "12")]

After (inner) join Posts and Authors dataframe on (Posts . authorid =author.authorId) and I got this

+----------+--------+---------------+                                  
|      date|authorId|       favCount|
+----------+--------+---------------+
|2020-10-10|       1|              5|
|2020-10-10|       2|              3|
|2020-10-10|       2|              3|
|2020-10-11|       2|              3|
|2020-10-11|       3|             12|
|2020-10-11|       3|             12|
+----------+--------+---------------+

现在我要计算每天作者的favCount的总和，最终结果应该是这样的

+----------+-------------+                                  
|      date|sum(favCount)|
+----------+-------------+
|2020-10-10|            8|
|2020-10-11|           15|
+----------+-------------+

在OCT, 10我有两个作者(1和2)，总共8 favCount (5+3)

10月11日，我有两个作者(2和3)，共有15个favCount (3+12)

p。S:我不想为重复的作者计算favCount，每个作者的favCount应该每天只计算一次

P.S(2):我正在使用PySpark和Dataframes，但我不介意在Pandas甚至SQL中的答案

考虑df1为Posts,df2为Authors数据帧

result = df1.merge(df2, how= 'inner').drop_duplicates(subset=['date','authorId'])
final = result.groupby([result.date])['favCount'].sum()

如果你想尝试spark，你可以试试这个

scala代码:

df1.join(df2, Seq("authorId"), "inner").groupBy("date", "authorId").sum()

或python:

df1.join(df2, ["authorId"], "inner").groupBy("date", "authorId").sum()

相关内容

最新更新

热门标签：