我在一个由5台机器(1台主机和4台从机)组成的集群中运行hadoop。我正在为普通推荐中的朋友运行一个地图缩减算法,我使用的文件有49995行(或者每个49995人后面跟着他的朋友)。
问题是,在集群上执行算法比在一台机器上执行算法需要更多的时间!!
我不知道这是否正常,因为文件不够大(因此由于机器之间的延迟,时间变慢),或者我必须更改一些内容才能在不同的节点上并行运行算法,但我认为这是自动完成的。
通常,在一台机器上运行算法需要以下步骤:
real 3m10.044s
user 2m53.766s
sys 0m4.531s
在集群上需要这个时间:
real 3m32.727s
user 3m10.229s
sys 0m5.545s
以下是我在master上执行start_all.sh脚本时的输出:
ubuntu@ip:/usr/local/hadoop-2.6.0$ sbin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [master]
master: starting namenode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-namenode-ip-172-31-37-184.out
slave1: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave1.out
slave2: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave2.out
slave3: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave3.out
slave4: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave4.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-secondarynamenode-ip-172-31-37-184.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-resourcemanager-ip-172-31-37-184.out
slave4: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave4.out
slave1: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave1.out
slave3: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave3.out
slave2: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave2.out
这是我执行stop_all.sh脚本时的输出:
ubuntu@ip:/usr/local/hadoop-2.6.0$ sbin/stop-all.sh
This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
Stopping namenodes on [master]
master: stopping namenode
slave4: no datanode to stop
slave3: stopping datanode
slave1: stopping datanode
slave2: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
stopping yarn daemons
stopping resourcemanager
slave2: no nodemanager to stop
slave3: no nodemanager to stop
slave4: no nodemanager to stop
slave1: no nodemanager to stop
no proxyserver to stop
提前谢谢!
一个可能的原因是您的文件没有上传到HDFS上。换句话说,它存储在一台机器上,所有其他正在运行的机器都必须从该机器获取数据。在运行mapreduce程序之前。您可以执行以下步骤:
1-确保HDFS已启动并正在运行。打开链接:主:50070其中master是运行namenode的节点的IP,并在该链接上检查所有节点是否处于活动状态。因此,如果您有4个数据节点,您应该看到:数据节点(4个活动)。
2-呼叫:
hdfs-dfs-将您的文件/someFolder放入hdfs/yourfile
通过这种方式,您已经将输入文件上传到HDFS,数据现在分布在多个节点之间。
现在试着运行你的程序,看看它是否更快
祝好运