我写了一个查询,以查找美国3月至4月最繁忙的10个机场。它产生所需的输出,但我想尝试进一步优化它。
是否有任何特定于 HiveQL 的优化可以应用于查询?GROUPING SETS
在这里适用吗?我是 Hive 的新手,目前这是我提出的最短查询。
SELECT airports.airport, COUNT(Flights.FlightsNum) AS Total_Flights
FROM (
SELECT Origin AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
UNION ALL
SELECT Dest AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
) Flights
INNER JOIN airports ON (Flights.Airport = airports.iata AND airports.country = 'USA')
GROUP BY airports.airport
ORDER BY Total_Flights DESC
LIMIT 10;
表列如下所示:
机场
|iata|airport|city|state|country|
Flights_stats
|originAirport|destAirport|FlightsNum|Cancelled|Month|
按机场(内连接(过滤,并在 UNION ALL 之前进行聚合,以减少传递给最终聚合化简器的数据集。具有联接的 UNION ALL 子查询应并行运行,并且比在 UNION ALL 之后使用更大的数据集联接更快。
SELECT f.airport, SUM(cnt) AS Total_Flights
FROM (
SELECT a.airport, COUNT(*) as cnt
FROM flights_stats f
INNER JOIN airports a ON f.Origin=a.iata AND a.country='USA'
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY a.airport
UNION ALL
SELECT a.airport, COUNT(*) as cnt
FROM flights_stats f
INNER JOIN airports a ON f.Dest=a.iata AND a.country='USA'
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY a.airport
) f
GROUP BY f.airport
ORDER BY Total_Flights DESC
LIMIT 10
;
调整映射联接并启用并行执行:
set hive.exec.parallel=true;
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory
使用 Tez 和矢量化,调整映射器和化简器并行性:https://stackoverflow.com/a/48487306/2700344
如果在union all
之前进行聚合可能会有所帮助:
SELECT a.airport, SUM(cnt) AS Total_Flights
FROM ((SELECT Origin AS Airport, COUNT(*) as cnt
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
GROUP BY Origin
) UNION ALL
(SELECT Dest AS Airport, COUNT(*) as cnt
FROM flights_stats
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY Dest
)
) f INNER JOIN
airports a
ON f.Airport = a.iata AND a.country = 'USA'
GROUP BY a.airport
ORDER BY Total_Flights DESC
LIMIT 10;
我认为分组集在这里不适用,因为您只按一个字段分组。
来自 Apache Wiki:"GROUP BY 中的 GROUPING SETS 子句允许我们在同一记录集中指定多个 GROUP BY 选项。">
你可以测试这个,但你处于联盟可能更好的情况下,所以你真的需要测试它并回来:
SELECT airports.airport,
SUM(
CASE
WHEN T1.FlightsNum IS NOT NULL THEN 1
WHEN T2.FlightsNum IS NOT NULL THEN 1
ELSE 0
END
) AS Total_Flights
FROM airports
LEFT JOIN (SELECT Origin AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))) t1
on t1.Airport = airports.iata
LEFT JOIN (SELECT Dest AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))) t2
on t1.Airport = airports.iata
GROUP BY airports.airport
ORDER BY Total_Flights DESC