我创建了一个 AWS EMR 集群,SSH 连接到主节点,启动 Hive,然后根据 AWS S3 存储桶中的数据创建了一个外部表。但是在某些查询中,我希望需要执行一些映射器或化简器作业,它不会这样做。e.x. 对于以下查询,我希望执行一些映射器作业,因为我们正在过滤到两列:
SELECT item, store FROM tt3 LIMIT 10;
但它没有,并且快速返回结果。explain
命令确认:
Stage-0 Fetch Operator
limit:10
Limit [LIM_2]
Number of rows:10
Select Operator [SEL_1]
Output:["_col0","_col1"]
TableScan [TS_0]
Output:["item","store"]
它在查询select count(*) from tt3;
上按预期工作,并首先运行MapReduce作业。
EXPLAIN COUNT(*) FROM tt3;
的输出
Vertex dependency in root stage
Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2
File Output Operator [FS_6]
Group By Operator [GBY_4] (rows=1 width=8)
Output:["_col0"],aggregations:["count(VALUE._col0)"]
<-Map 1 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_3]
Group By Operator [GBY_2] (rows=1 width=8)
Output:["_col0"],aggregations:["count()"]
Select Operator [SEL_1] (rows=1 width=211312928)
TableScan [TS_0] (rows=1 width=211312928)
default@tt3,tt3,Tbl:COMPLETE,Col:COMPLETE
这是Hive 的预期行为。
在 Hive 中,如果您执行像select * from table
这样的简单查询,则不会运行 map Reduce 作业,因为我们只是从 HDFS 转储数据。
Hive# select * from foo;
+---------+-----------+----------+--+
| foo.id | foo.name | foo.age |
+---------+-----------+----------+--+
| 1 | a | 10 |
| 2 | a | 10 |
| 3 | b | 10 |
| 4 | c | 20 |
+---------+-----------+----------+--+
4 rows selected (0.116 seconds)
当您进行聚合时,reducer
阶段将与map
阶段一起执行。
Hive# select count(*) from table group by name;
INFO : Map 1: 0/1 Reducer 2: 0/2
INFO : Map 1: 0(+1)/1 Reducer 2: 0/2
INFO : Map 1: 0(+1)/1 Reducer 2: 0/2
INFO : Map 1: 0(+1)/1 Reducer 2: 0/2
INFO : Map 1: 0(+1)/1 Reducer 2: 0/2
INFO : Map 1: 1/1 Reducer 2: 0/1
INFO : Map 1: 1/1 Reducer 2: 0(+1)/1
INFO : Map 1: 1/1 Reducer 2: 1/1
+------+--+
| _c0 |
+------+--+
| 2 |
| 1 |
| 1 |
+------+--+
3 rows selected (13.709 seconds)
我们可以通过在上面的查询中添加 order by 子句来添加另一个化简器阶段
Hive# select count(*) cnt from foo group by name order by cnt;
INFO : Map 1: 0/1 Reducer 2: 0/2 Reducer 3: 0/1
INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 Reducer 3: 0/1
INFO : Map 1: 1/1 Reducer 2: 0/1 Reducer 3: 0/1
INFO : Map 1: 1/1 Reducer 2: 0(+1)/1 Reducer 3: 0/1
INFO : Map 1: 1/1 Reducer 2: 1/1 Reducer 3: 0(+1)/1
INFO : Map 1: 1/1 Reducer 2: 1/1 Reducer 3: 1/1
+------+--+
| cnt |
+------+--+
| 1 |
| 1 |
| 2 |
+------+--+
您可以看到2 个化简器阶段已完成,因为在聚合后,我们正在按结果进行排序
Map1 phase:- Loads the data from HDFS.
Reduer2:- Will does aggregation
Reducer 3:- after aggregation it will order the results to ascending order.
如果您确实对上述查询进行了解释
Hive# explain select count(*) cnt from foo group by name order by cnt;
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
请参阅此链接以熟悉Hive使用Map/Reduce副业。