我在四个节点上设置了hadoop。一个节点用于Namenode和辅助Namenode。另外三个是数据节点。我运行了一个复制因子为3的sqoop作业。sqoop作业成功,数据位于所有三个数据节点上。用6名制图员完成这项工作大约花了1.5个小时。我运行了相同的作业,复制因子为1。这项工作也很成功,用12个映射器运行了大约1个小时
我的问题是:
1. when i ran the job for second time with replication factor of 1 where is the data stored? (Is the data split and stored in all the three datanodes? (or) The data is stored on the machine from which i ran the job? )
2. I have 6 core processors on each datanode with 64 GB of ram. Which are the properties should i set to obtain optimum values for the sqoop job?
以下是第一个作业的日志:
15/06/30 00:21:28 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=749046
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=864
HDFS: Number of bytes written=253986997858
HDFS: Number of read operations=24
HDFS: Number of large read operations=0
HDFS: Number of write operations=12
Job Counters
Launched map tasks=6
Other local map tasks=6
Total time spent by all maps in occupied slots (ms)=20582400
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=20582400
Total vcore-seconds taken by all map tasks=20582400
Total megabyte-seconds taken by all map tasks=73767321600
Map-Reduce Framework
Map input records=162991238
Map output records=162991238
Input split bytes=864
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=187671
CPU time spent (ms)=21216950
Physical memory (bytes) snapshot=5210345472
Virtual memory (bytes) snapshot=57549950976
Total committed heap usage (bytes)=6410469376
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=253986997858
15/06/30 00:21:28 INFO mapreduce.ImportJobBase: Transferred 236.5438 GB in 5,524.6156 seconds (43.8439 MB/sec)
15/06/30 00:21:28 INFO mapreduce.ImportJobBase: Retrieved 162991238 records.
以下是第二个作业的日志:
15/06/30 10:21:02 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=1498130
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1744
HDFS: Number of bytes written=253986997858
HDFS: Number of read operations=48
HDFS: Number of large read operations=0
HDFS: Number of write operations=24
Job Counters
Launched map tasks=12
Other local map tasks=12
Total time spent by all maps in occupied slots (ms)=22551454
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=22551454
Total vcore-seconds taken by all map tasks=22551454
Total megabyte-seconds taken by all map tasks=80824411136
Map-Reduce Framework
Map input records=162991238
Map output records=162991238
Input split bytes=1744
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=186898
CPU time spent (ms)=21910100
Physical memory (bytes) snapshot=9802846208
Virtual memory (bytes) snapshot=115099107328
Total committed heap usage (bytes)=12298747904
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=253986997858
15/06/30 10:21:02 INFO mapreduce.ImportJobBase: Transferred 236.5438 GB in 3,647.7444 seconds (66.4029 MB/sec)
15/06/30 10:21:02 INFO mapreduce.ImportJobBase: Retrieved 162991238 records.
以下是我对您两个问题的回答。1.当您使用复制因子1运行时。HDFS中数据块的副本是一个,但数据将分布在所有三个节点上。数据块自动分布在集群中,这就是为什么。
- 根据集群中可用的核心/插槽,指定作业中的映射器数量,这将是最佳的。这里有6个核心机器,我假设映射器的核心分配是4,减速器是2。因此,您有4*3*2(每个核心上可以运行2个映射器)=24个映射器将是此工作的最佳选择。默认情况下
希望这能澄清你的疑虑。