mapreduce中合并器和映射器内组合器的区别

我是Hadoop和mapreduce的新手。有人可以澄清合并器和映射器组合器之间的区别还是一回事？

您可能

已经知道，合并器是一个在每台 Mapper 机器上本地运行的进程，用于在数据通过网络洗牌到各种集群 Reducer 之前预先聚合数据。

映射器组合器更

进一步优化：聚合甚至不写入本地磁盘：它们发生在映射器本身的内存中。

映射器组合器通过利用 setup（）和 cleanup（）方法来做到这一点

org.apache.hadoop.mapreduce.Mapper

按照以下行创建内存中映射：

Map<LongWritable, Text> inmemMap = null
   protected void setup(Mapper.Context context) throws IOException, InterruptedException {
   inmemMap  = new Map<LongWritable, Text>();
 }

然后在每次 map（）调用期间，您将值添加到内存映射中（而不是在每个值上调用 context.write（）。最后，Map/Reduce框架将自动调用：

protected void cleanup(Mapper.Context context) throws IOException, InterruptedException {
  for (LongWritable key : inmemMap.keySet()) {
      Text myAggregatedText = doAggregation(inmemMap.get(key))// do some aggregation on 
                   the inmemMap.     
      context.write(key, myAggregatedText);
  }
}

请注意，不是每次都调用 context.write（），而是将条目添加到内存映射中。然后在 cleanup（）方法中调用 context.write（），但使用来自内存映射的压缩/预聚合结果。因此，您的本地映射输出假脱机文件（将由化简器读取）将小得多。

在这两种情况下 - 无论是在内存中还是在外部组合器中 - 由于映射假脱机文件较小，因此您可以获得减少化简器网络流量的好处。这也减少了减速机的加工。

相关内容

最新更新

热门标签：