为什么Hive在某些情况下不使用MapReduce?



我创建了一个 AWS EMR 集群,SSH 连接到主节点,启动 Hive,然后根据 AWS S3 存储桶中的数据创建了一个外部表。但是在某些查询中,我希望需要执行一些映射器或化简器作业,它不会这样做。e.x. 对于以下查询,我希望执行一些映射器作业,因为我们正在过滤到两列:

SELECT item, store FROM tt3 LIMIT 10;

但它没有,并且快速返回结果。explain命令确认:

Stage-0   Fetch Operator
limit:10
Limit [LIM_2]
Number of rows:10
Select Operator [SEL_1]
Output:["_col0","_col1"]
TableScan [TS_0]
Output:["item","store"]

它在查询select count(*) from tt3;上按预期工作,并首先运行MapReduce作业。

EXPLAIN COUNT(*) FROM tt3;的输出

Vertex dependency in root stage
Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2
File Output Operator [FS_6]
Group By Operator [GBY_4] (rows=1 width=8)
Output:["_col0"],aggregations:["count(VALUE._col0)"]
<-Map 1 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_3]
Group By Operator [GBY_2] (rows=1 width=8)
Output:["_col0"],aggregations:["count()"]
Select Operator [SEL_1] (rows=1 width=211312928)
TableScan [TS_0] (rows=1 width=211312928)
default@tt3,tt3,Tbl:COMPLETE,Col:COMPLETE

这是Hive 的预期行为。

在 Hive 中,如果您执行像select * from table这样的简单查询,则不会运行 map Reduce 作业,因为我们只是从 HDFS 转储数据。

Hive# select * from foo;
+---------+-----------+----------+--+
| foo.id  | foo.name  | foo.age  |
+---------+-----------+----------+--+
| 1       | a         | 10       |
| 2       | a         | 10       |
| 3       | b         | 10       |
| 4       | c         | 20       |
+---------+-----------+----------+--+
4 rows selected (0.116 seconds)

当您进行聚合时,reducer阶段将与map阶段一起执行。

Hive# select count(*) from table group by name;
INFO  : Map 1: 0/1      Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 1/1      Reducer 2: 0/1
INFO  : Map 1: 1/1      Reducer 2: 0(+1)/1
INFO  : Map 1: 1/1      Reducer 2: 1/1
+------+--+
| _c0  |
+------+--+
| 2    |
| 1    |
| 1    |
+------+--+
3 rows selected (13.709 seconds)

我们可以通过在上面的查询中添加 order by 子句来添加另一个化简器阶段

Hive# select count(*) cnt from foo group by name order by cnt;
INFO  : Map 1: 0/1      Reducer 2: 0/2  Reducer 3: 0/1
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2  Reducer 3: 0/1
INFO  : Map 1: 1/1      Reducer 2: 0/1  Reducer 3: 0/1
INFO  : Map 1: 1/1      Reducer 2: 0(+1)/1      Reducer 3: 0/1
INFO  : Map 1: 1/1      Reducer 2: 1/1  Reducer 3: 0(+1)/1
INFO  : Map 1: 1/1      Reducer 2: 1/1  Reducer 3: 1/1
+------+--+
| cnt  |
+------+--+
| 1    |
| 1    |
| 2    |
+------+--+

您可以看到2 个化简器阶段已完成,因为在聚合后,我们正在按结果进行排序

Map1 phase:- Loads the data from HDFS.
Reduer2:- Will does aggregation
Reducer 3:- after aggregation it will order the results to ascending order.

如果您确实对上述查询进行了解释

Hive# explain select count(*) cnt from foo group by name order by cnt;
Vertex dependency in root stage     
Reducer 2 <- Map 1 (SIMPLE_EDGE)    
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)

请参阅此链接以熟悉Hive使用Map/Reduce副业。

相关内容

最新更新