我是一个简单的网络爬虫,它是使用crawler4j的构建块构建的。我试图在我的爬虫爬行时构建一个字典,然后在它构建和解析文本时将其传递给我的主(控制器(。既然我的MyCrawler对象不是在我的主类中创建的(使用MyCrawler.class作为第一个参数(,我该怎么做?此外,我无法更改controller.start方法。我希望能够在爬网程序完成后使用在爬网程序中创建的词典。
我能想到的最好的方法是让controller.start获取一个预定义并创建的MyCrawler对象,但我可以看到,没有办法做到这一点。
下面是我的代码。非常感谢你的帮助!
爬行器:
public class MyCrawler extends WebCrawler
{
private final static Pattern FILTERS = Pattern.compile(".*(\.(css|js|gif|jpg|png|mp3|mp3|zip|gz))$");
public ArrayList<String> dictionary = new ArrayList<String>();
@Override public boolean shouldVisit(Page referringPage, WebURL url)
{
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("http://lyle.smu.edu/~fmoore"));
}
@Override public void visit(Page page)
{
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
if(page.getParseData() instanceof HtmlParseData)
{
HtmlParseData h = (HtmlParseData)page.getParseData();
String text = h.getText();
String[] words = text.split(" ");
for(int i = 0;i < words.length;i++)
{
if(!words[i].equals("") || !words[i].equals(null) || !words[i].equals("n"))
dictionary.add(words[i]);
}
String html = h.getHtml();
Set<WebURL> links = h.getOutgoingUrls();
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
System.out.println(text);
}
}
}
控制器:
public class Controller
{
public ArrayList<String> dictionary = new ArrayList<String>();
public static void main(String[] args) throws Exception
{
int numberOfCrawlers = 1;
String crawlStorageFolder = "/data/crawl/root";
CrawlConfig c = new CrawlConfig();
c.setCrawlStorageFolder(crawlStorageFolder);
c.setMaxDepthOfCrawling(-1); //Unlimited Depth
c.setMaxPagesToFetch(-1); //Unlimited Pages
c.setPolitenessDelay(200); //Politeness Delay
PageFetcher pf = new PageFetcher(c);
RobotstxtConfig robots = new RobotstxtConfig();
RobotstxtServer rs = new RobotstxtServer(robots, pf);
CrawlController controller = new CrawlController(c, pf, rs);
controller.addSeed("http://lyle.smu.edu/~fmoore");
controller.start(MyCrawler.class, numberOfCrawlers);
controller.shutdown();
controller.waitUntilFinish();
}
}
让WebCrawlerFactory
创建您的MyCrawler
对象。这应该可以做到(至少从4.2版本开始(。然而,您的dictionary
应该支持并发访问(简单的ArrayList
不支持!(
// use a factory, instead of supplying the crawler type to pass the dictionary
controller.start(new WebCrawlerFactory<MyCrawler>() {
@Override
public MyCrawler newInstance() throws Exception {
return new MyCrawler(dictionary);
}
}, numberOfCrawlers);