我在1MB数据上运行了Hadoop-Mapreduce作业字数程序。我有一些疑问来理解下面的信息:
- 什么是计数器?
-
为什么map任务是两个,因为我知道映射的数量是由输入拆分的#决定的,输入拆分的最小大小为64MB。所以从逻辑上讲应该只有一个地图任务!?
-
化简器的输出数据大小是多少?
CPU 花费的时间,哪个CPU导致每个任务跟踪器都有自己的CPU和内存?
多谢!
[user1@li417-43 ~]$ hadoop jar wordcount1.jar wordcount1.WordCount -D mapred.reduce.tasks=10 wordin wordout10-1m
14/12/16 19:55:46 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/12/16 19:55:46 INFO mapred.FileInputFormat: Total input paths to process : 1
14/12/16 19:55:46 INFO mapred.JobClient: Running job: job_201405031326_0032
14/12/16 19:55:47 INFO mapred.JobClient: map 0% reduce 0%
14/12/16 19:55:59 INFO mapred.JobClient: map 100% reduce 0%
14/12/16 19:56:04 INFO mapred.JobClient: map 100% reduce 40%
14/12/16 19:56:09 INFO mapred.JobClient: map 100% reduce 80%
14/12/16 19:56:14 INFO mapred.JobClient: map 100% reduce 100%
14/12/16 19:56:15 INFO mapred.JobClient: Job complete: job_201405031326_0032
14/12/16 19:56:15 INFO mapred.JobClient: Counters: 34
14/12/16 19:56:15 INFO mapred.JobClient: File System Counters
14/12/16 19:56:15 INFO mapred.JobClient: FILE: Number of bytes read=2008100
14/12/16 19:56:15 INFO mapred.JobClient: FILE: Number of bytes written=5988058
14/12/16 19:56:15 INFO mapred.JobClient: FILE: Number of read operations=0
14/12/16 19:56:15 INFO mapred.JobClient: FILE: Number of large read operations=0
14/12/16 19:56:15 INFO mapred.JobClient: FILE: Number of write operations=0
14/12/16 19:56:15 INFO mapred.JobClient: HDFS: Number of bytes read=1005254
14/12/16 19:56:15 INFO mapred.JobClient: HDFS: Number of bytes written=140119
14/12/16 19:56:15 INFO mapred.JobClient: HDFS: Number of read operations=14
14/12/16 19:56:15 INFO mapred.JobClient: HDFS: Number of large read operations=0
14/12/16 19:56:15 INFO mapred.JobClient: HDFS: Number of write operations=20
14/12/16 19:56:15 INFO mapred.JobClient: Job Counters
14/12/16 19:56:15 INFO mapred.JobClient: Launched map tasks=2
14/12/16 19:56:15 INFO mapred.JobClient: Launched reduce tasks=10
14/12/16 19:56:15 INFO mapred.JobClient: Data-local map tasks=1
14/12/16 19:56:15 INFO mapred.JobClient: Rack-local map tasks=1
14/12/16 19:56:15 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=12953
14/12/16 19:56:15 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=49609
14/12/16 19:56:15 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/12/16 19:56:15 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/12/16 19:56:15 INFO mapred.JobClient: Map-Reduce Framework
14/12/16 19:56:15 INFO mapred.JobClient: Map input records=35293
14/12/16 19:56:15 INFO mapred.JobClient: Map output records=181014
14/12/16 19:56:15 INFO mapred.JobClient: Map output bytes=1646012
14/12/16 19:56:15 INFO mapred.JobClient: Input split bytes=206
14/12/16 19:56:15 INFO mapred.JobClient: Combine input records=0
14/12/16 19:56:15 INFO mapred.JobClient: Combine output records=0
14/12/16 19:56:15 INFO mapred.JobClient: Reduce input groups=14276
14/12/16 19:56:15 INFO mapred.JobClient: Reduce shuffle bytes=2008160
14/12/16 19:56:15 INFO mapred.JobClient: Reduce input records=181014
14/12/16 19:56:15 INFO mapred.JobClient: Reduce output records=14276
14/12/16 19:56:15 INFO mapred.JobClient: Spilled Records=362028
14/12/16 19:56:15 INFO mapred.JobClient: CPU time spent (ms)=26020
14/12/16 19:56:15 INFO mapred.JobClient: Physical memory (bytes) snapshot=1427562496
14/12/16 19:56:15 INFO mapred.JobClient: Virtual memory (bytes) snapshot=8291246080
14/12/16 19:56:15 INFO mapred.JobClient: Total committed heap usage (bytes)=477896704
14/12/16 19:56:15 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
14/12/16 19:56:15 INFO mapred.JobClient: BYTES_READ=1002479
- 计数器
:34是计数器数量(以下信息数量)
我认为,这是由于投机执行(在[https://developer.yahoo.com/hadoop/tutorial/module4.html]上搜索投机)。Hadoop 会启动 2 次相同的映射器,看看哪个会先完成(然后第二个被杀死)。您可以通过更改映射站点.xml文件中的
mapred.map.tasks.speculative.execution
配置属性来禁用它。
在本地启动,第二个映射器在同一机架上,但在另一个节点上。(数据本地映射任务 = 1,机架本地映射任务 = 1)
您的化简器的输出中有 14276 行(减少输出记录 = 14276)。
花费的 CPU 时间 (ms) 是每个节点上每个任务消耗的 CPU 时间的总时间。这是为了比较目的。