小贝子编程

我正在遵循 Nutch 教程，并收到“没有要获取的 URL”错误

按照Apache Nutch教程：

如教程所示，我将正则表达式 urlfilter 的最后一行设置为.txt：

+^http://([a-z0-9]*.)*nutch.apache.org/

我的坚果网站.xml文件只包含行

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

我的种子.txt文件是：

http://nutch.apache.org/

然而，当我爬行时

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

我收到"没有要获取的 URL"错误。有人知道为什么吗？

配置对我来说

看起来不错。您是否在运行时/本地文件夹中进行了这些更改，对吗？seed.txt 将位于 NUTCH_HOME/运行时/本地/urls 文件夹中，并且regex-urlfilter.txt 和 nutch-site.xml 将位于 NUTCH_HOME/runtime/local/conf 文件夹中

NUTCH_HOME是安装目录

相关内容