Spark SQL query: org.apache.spark.sql.AnalysisException



我正试图为twitter json文件编写查询,以通过查看retweetCount提取最具影响力的人。我需要按用户、时区和转发数量降序对输出进行分组。当我运行下面的查询时,我一直得到异常:

org.apache.spark.sql.AnalysisExceptionorg.apache.spark.sql.AnalysisException:无法解析给定输入列的"total_retwets"T.retweeted_screen_name, t.tz, total_retweets, tweet_count;

sqlContext.sql("""
    SELECT
     t.retweeted_screen_name,
         t.tz,
     sum(retweets) AS total_retweets,
     count(*) AS tweet_count
    FROM (SELECT
            actor.displayName as retweeted_screen_name,
            body,
        actor.twitterTimeZone as tz,
            max(retweetCount) as retweets
          FROM tweetTable WHERE body <> ''
          GROUP BY actor.displayName, actor.twitterTimeZone,
                   body) t
    GROUP BY t.retweeted_screen_name, t.tz
    ORDER BY total_retweets DESC
    LIMIT 10 """).collect.foreach(println)

当我尝试简化这个查询时,我遇到如下错误:

列total_retwets在选择列表中无效,因为它不是包含在聚合函数或GROUP BY子句中。

当您运行SQL查询时,它不会计算解析每个查询的别名,直到WHERE, JOIN, GROUP BYORDER BY子句运行之后(但它确实在任何HAVING子句之前执行此操作)。因此你不能ORDER BY total_retwets,你需要ORDER BY sum(retweets)

相关内容

  • 没有找到相关文章

最新更新