我正试图为twitter json文件编写查询,以通过查看retweetCount提取最具影响力的人。我需要按用户、时区和转发数量降序对输出进行分组。当我运行下面的查询时,我一直得到异常:
org.apache.spark.sql.AnalysisExceptionorg.apache.spark.sql.AnalysisException:无法解析给定输入列的"total_retwets"T.retweeted_screen_name, t.tz, total_retweets, tweet_count;
sqlContext.sql("""
SELECT
t.retweeted_screen_name,
t.tz,
sum(retweets) AS total_retweets,
count(*) AS tweet_count
FROM (SELECT
actor.displayName as retweeted_screen_name,
body,
actor.twitterTimeZone as tz,
max(retweetCount) as retweets
FROM tweetTable WHERE body <> ''
GROUP BY actor.displayName, actor.twitterTimeZone,
body) t
GROUP BY t.retweeted_screen_name, t.tz
ORDER BY total_retweets DESC
LIMIT 10 """).collect.foreach(println)
当我尝试简化这个查询时,我遇到如下错误:
列total_retwets在选择列表中无效,因为它不是包含在聚合函数或GROUP BY子句中。
当您运行SQL查询时,它不会计算解析每个查询的别名,直到WHERE
, JOIN
, GROUP BY
和ORDER BY
子句运行之后(但它确实在任何HAVING
子句之前执行此操作)。因此你不能ORDER BY
total_retwets,你需要ORDER BY sum(retweets)