在5台机器的集群中以完全分布式模式运行Hadoop比在一台机器中运行需要更多的时间



我在一个由5台机器(1台主机和4台从机)组成的集群中运行hadoop。我正在为普通推荐中的朋友运行一个地图缩减算法,我使用的文件有49995行(或者每个49995人后面跟着他的朋友)。

问题是,在集群上执行算法比在一台机器上执行算法需要更多的时间!!

我不知道这是否正常,因为文件不够大(因此由于机器之间的延迟,时间变慢),或者我必须更改一些内容才能在不同的节点上并行运行算法,但我认为这是自动完成的。

通常,在一台机器上运行算法需要以下步骤:

   real 3m10.044s
   user 2m53.766s
   sys  0m4.531s

在集群上需要这个时间:

    real    3m32.727s
    user    3m10.229s
    sys 0m5.545s

以下是我在master上执行start_all.sh脚本时的输出:

    ubuntu@ip:/usr/local/hadoop-2.6.0$ sbin/start-all.sh 
    This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
    Starting namenodes on [master]
    master: starting namenode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-namenode-ip-172-31-37-184.out
    slave1: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave1.out
    slave2: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave2.out
    slave3: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave3.out
    slave4: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave4.out
    Starting secondary namenodes [0.0.0.0]
    0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-secondarynamenode-ip-172-31-37-184.out
    starting yarn daemons
    starting resourcemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-resourcemanager-ip-172-31-37-184.out
    slave4: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave4.out
    slave1: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave1.out
    slave3: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave3.out
    slave2: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave2.out

这是我执行stop_all.sh脚本时的输出:

   ubuntu@ip:/usr/local/hadoop-2.6.0$ sbin/stop-all.sh 
   This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
   Stopping namenodes on [master]
   master: stopping namenode
   slave4: no datanode to stop
   slave3: stopping datanode
   slave1: stopping datanode
   slave2: stopping datanode
   Stopping secondary namenodes [0.0.0.0]
   0.0.0.0: stopping secondarynamenode
   stopping yarn daemons
   stopping resourcemanager
   slave2: no nodemanager to stop
   slave3: no nodemanager to stop
   slave4: no nodemanager to stop
   slave1: no nodemanager to stop
   no proxyserver to stop

提前谢谢!

一个可能的原因是您的文件没有上传到HDFS上。换句话说,它存储在一台机器上,所有其他正在运行的机器都必须从该机器获取数据。在运行mapreduce程序之前。您可以执行以下步骤:

1-确保HDFS已启动并正在运行。打开链接::50070其中master是运行namenode的节点的IP,并在该链接上检查所有节点是否处于活动状态。因此,如果您有4个数据节点,您应该看到:数据节点(4个活动)。

2-呼叫:

hdfs-dfs-将您的文件/someFolder放入hdfs/yourfile

通过这种方式,您已经将输入文件上传到HDFS,数据现在分布在多个节点之间。

现在试着运行你的程序,看看它是否更快

祝好运

相关内容

  • 没有找到相关文章

最新更新