我有posts dataframe和authors dataframe
我需要计算每天文章作者的favCount
posts_columns = ["postId", "authorId", "date"]
posts = [("1", "1", "2020-10-10"),
("2", "2", "2020-10-10"),
("3", "2", "2020-10-10"),
("4", "2", "2020-10-11")
("5", "3", "2020-10-11")
("6", "3", "2020-10-11")]
authors_columns = ["authorId", "favCount"]
authors = [
("1", "5"),
("2", "3"),
("3", "12")]
After (inner) join Posts and Authors dataframe on (Posts . authorid =author.authorId) and I got this
+----------+--------+---------------+
| date|authorId| favCount|
+----------+--------+---------------+
|2020-10-10| 1| 5|
|2020-10-10| 2| 3|
|2020-10-10| 2| 3|
|2020-10-11| 2| 3|
|2020-10-11| 3| 12|
|2020-10-11| 3| 12|
+----------+--------+---------------+
现在我要计算每天作者的favCount的总和,最终结果应该是这样的
+----------+-------------+
| date|sum(favCount)|
+----------+-------------+
|2020-10-10| 8|
|2020-10-11| 15|
+----------+-------------+
在OCT, 10我有两个作者(1和2),总共8 favCount (5+3)
10月11日,我有两个作者(2和3),共有15个favCount (3+12)
p。S:我不想为重复的作者计算favCount,每个作者的favCount应该每天只计算一次
P.S(2):我正在使用PySpark和Dataframes,但我不介意在Pandas甚至SQL中的答案
考虑df1
为Posts,df2
为Authors数据帧
result = df1.merge(df2, how= 'inner').drop_duplicates(subset=['date','authorId'])
final = result.groupby([result.date])['favCount'].sum()
如果你想尝试spark,你可以试试这个
scala代码:
df1.join(df2, Seq("authorId"), "inner").groupBy("date", "authorId").sum()
或python:
df1.join(df2, ["authorId"], "inner").groupBy("date", "authorId").sum()