使用配置单元/Pyspark的列引用不明确



我需要执行以下查询:

SELECT 
a.uuid,
b.uuid
FROM
(
SELECT DISTINCT
l.uuid AS east,
r.uuid AS west
FROM 
tabl1            AS l
INNER JOIN tabl1 AS r
USING (attr_group)
WHERE l.uuid < r.uuid
)                           AS ids
INNER JOIN tabl2            AS a 
ON (ids.east = a.uuid)
INNER JOIN tabl2            AS b
ON (ids.west = b.uuid)

问题是,当我有两个a.uuid和b.uuid时,代码会在uuuid上给出ambigus错误。但是,当我只尝试使用a.uuid或b.uuid时,查询会起作用。

另一个观察结果是,查询使用hive.execute(query(运行,但不使用hive.executeQuery(query(。我使用的是Zeppelin中的pyspark解释器。

HIVE版本为:3.1.0。

最终结果集中的两列具有相同的名称。只需将它们别名以消除歧义:

SELECT 
a.uuid AS uuid_a,  --> here
b.uuid AS uuid_b   --> and here
FROM (
SELECT DISTINCT
l.uuid AS east,
r.uuid AS west
FROM tabl1 AS l
INNER JOIN tabl1 AS r USING (attr_group)
WHERE l.uuid < r.uuid
) AS ids
INNER JOIN tabl2 AS a ON ids.east = a.uuid
INNER JOIN tabl2 AS b ON ids.west = b.uuid

请注意,联接条件周围的括号是不必要的——我删除了它们。

最新更新