我有一个文件,其中的行如下:
232404812.913232|1248|ip:tcp:jxta
232404812.913238|66|ip:udp:data
232404812.913615|98|ip:udp:l2tp:ppp:ip:tcp
我执行了以下HiveQL命令:
CREATE EXTERNAL TABLE b_packet (timestamp string, packet_length int, protocol string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "|"
LOCATION 's3://b-file/input/';
CREATE EXTERNAL TABLE b_packet_out (protocol string, cnt int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "t"
LOCATION 's3://b-file/output/1/';
INSERT OVERWRITE TABLE b_packet_out SELECT 'overall',
COUNT(*) FROM b_packet GROUP BY protocol;
INSERT INTO TABLE b_packet_out SELECT 'tcp',
COUNT(*) FROM b_packet WHERE protocol REGEXP '^ip:tcp';
INSERT INTO TABLE b_packet_out SELECT 'udp',
COUNT(*) FROM b_packet WHERE protocol REGEXP '^ip:udp';
INSERT INTO TABLE b_packet_out SELECT 'icmp',
COUNT(*) FROM b_packet WHERE protocol REGEXP '^ip:icmp';
因此,我在输出表中有以下内容。
hive> select * from b_packet_out;
OK
udp 2241
overall 10000
icmp 64
tcp 7633
对于HiveQL查询,有没有更优雅的方法可以减少行数以获得相同的输出?
select
count(*) as overall,
sum( if(protocol like '^ip:tcp',1,0) as tcp,
sum( if(protocol like '^ip:udp',1.0) as udp,
sum( if(protocol like '^ip:icmp'1,0) as icmp
from b_packet
这将通过一次数据传递生成相同的计数。
如果你有更多的协议,你也可以说选择split(协议,":")[1],计数(*)按拆分分组(协议,":")[1]但这并不能给出总数。
这里有一个不同的解决方案,但它对数据进行多次传递,并不会真正节省代码行的数量:
SELECT CASE WHEN GROUPING__ID = 0 THEN 'overall' ELSE
CASE WHEN protocol LIKE 'ip:tcp%' THEN 'tcp'
WHEN protocol LIKE 'ip:udp%' THEN 'udp'
WHEN protocol LIKE 'ip:icmp%' THEN 'icmp' END END AS protocol
, COUNT(1) AS cnt
FROM b_packet
GROUP BY CASE WHEN protocol LIKE 'ip:tcp%' THEN 'tcp'
WHEN protocole LIKE 'ip:udp%' THEN 'udp'
WHEN protocol LIKE 'ip:icmp%' THEN 'icmp' END
GROUPING SETS (
(CASE WHEN protocol LIKE 'ip:tcp%' THEN 'tcp'
WHEN protocol LIKE 'ip:udp%' THEN 'udp'
WHEN protocol LIKE 'ip:icmp%' THEN 'icmp' END)
, ()
)