我得到错误输入路径不存在时,我运行命令
nutch inject crawldb urls
在nutch/logs中我在hadoop.log中得到了这个错误
2015-08-16 16:08:12,834 INFO crawl.Injector - Injector: starting at 2015-08-16 16:08:12
2015-08-16 16:08:12,834 INFO crawl.Injector - Injector: crawlDb: crawldb
2015-08-16 16:08:12,835 INFO crawl.Injector - Injector: urlDir: urls
2015-08-16 16:08:12,835 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries.
2015-08-16 16:08:13,296 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-08-16 16:08:13,417 WARN snappy.LoadSnappy - Snappy native library not loaded
2015-08-16 16:08:13,430 ERROR security.UserGroupInformation - PriviledgedActionException as:hdravi cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/hdravi/urls
2015-08-16 16:08:13,432 ERROR crawl.Injector - Injector: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/hdravi/urls
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.crawl.Injector.inject(Injector.java:323)
at org.apache.nutch.crawl.Injector.run(Injector.java:379)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
如何在本地文件系统中搜索
hadoop的core-site.xml的内容
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
hadoop的hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
当我输入hadoop fs -ls -R /
时,这是我得到的输出
drwxrwxrwx - hdravi supergroup 0 2015-08-16 16:06 /user
drwxrwxrwx - hdravi supergroup 0 2015-08-16 16:06 /user/hdravi
drwxr-xr-x - hdravi supergroup 0 2015-08-16 16:06 /user/hdravi/urls
-rw-r--r-- 1 hdravi supergroup 240 2015-08-16 16:06 /user/hdravi/urls/seed.txt
我在hadoop/nutch中缺少任何配置吗?
当我使用完整的HDFS路径
时,我得到以下错误2015-08-16 23:33:22,876 INFO crawl.Injector - Injector: starting at 2015-08-16 23:33:22
2015-08-16 23:33:22,877 INFO crawl.Injector - Injector: crawlDb: crawldb
2015-08-16 23:33:22,877 INFO crawl.Injector - Injector: urlDir: hdfs://localhost:54310/user/hdravi/user/hdravi/urls
2015-08-16 23:33:22,878 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries.
2015-08-16 23:33:23,317 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-08-16 23:33:23,410 WARN snappy.LoadSnappy - Snappy native library not loaded
2015-08-16 23:33:23,762 ERROR security.UserGroupInformation - PriviledgedActionException as:hdravi cause:org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
2015-08-16 23:33:23,764 ERROR crawl.Injector - Injector: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:281)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:245)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1437)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1455)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.crawl.Injector.inject(Injector.java:323)
at org.apache.nutch.crawl.Injector.run(Injector.java:379)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
我不确定nutch,但对于Hadoop,尝试在启动MapReduce作业之前使用配置对象加载配置文件。
这个解决方案对我很有效:
Configuration conf = new Configuration();
conf.addResource(new Path("path to hadoop/conf/core-site.xml"));
conf.addResource(new Path("path to hadoop/conf/hdfs-site.xml"));
FileSystem fs = FileSystem.get(conf);
您也可以尝试使用输入目录
的完整路径。hdfs://localhost:54310/user/hdravi