Casperjs使用casper.each迭代链接列表



我试图使用Casperjs从页面获取链接列表,然后打开每个链接,并将来自这些页面的特定类型的数据添加到数组对象。

我遇到的问题是在每个列表项上执行的循环。

首先我从原始页面获得一个listOfLinks。这部分工作,使用长度我可以检查这个列表是否填充。

然而,使用下面的循环语句this.each,没有任何控制台语句出现,casperjs似乎跳过了这个块。

用标准的for循环替换this.each,执行只通过第一个链接得到一部分,因为语句"为x.html在对象中创建新数组"出现一次,然后代码停止执行。使用IIFE不会改变这一点。

Edit:在详细调试模式下会发生以下情况:

Creating new array object for https://example.com 
[debug] [phantom] Navigation requested: url=about:blank, type=Other, willNavigate=true, isMainFrame=true

由于某些原因,传递给thenOpen函数的URL变成了空白…

我觉得有一些关于Casperjs的异步性质,我没有掌握在这里,并将感激指出一个工作的例子。

casper.then(function () {
  var date = Date.now();
  console.log(date);
  var object = {};
  object[date] = {}; // new object for date
  var listOfLinks = this.evaluate(function(){
    console.log("getting links");
    return document.getElementsByClassName('importantLink');
  });
  console.log(listOfLinks.length);
  this.each(listOfLinks, function(self, link) {
    var eachPageHref = link.href;
    console.log("Creating new array in object for " + eachPageHref);
    object[date][eachPageHref] = []; // array for page to store names
    self.thenOpen(eachPageHref, function () {
      var listOfItems = this.evaluate(function() {
        var items = [];
        // Perform DOM manipulation to get items
        return items;
      });
    });
    object[date][eachPageHref] = items;
  });
  console.log(JSON.stringify(object));
});

我决定使用我们自己的Stackoverflow.com作为演示站点来运行您的脚本。我在你的代码中纠正了一些小事情,结果是这个练习从PhantomJS赏金问题中获得评论。

var casper = require('casper').create();
casper
.start()
.open('http://stackoverflow.com/questions/tagged/phantomjs?sort=featured&pageSize=30')
.then(function () {
    var date = Date.now(), object = {};
    object[date] = {};
    var listOfLinks = this.evaluate(function(){
        // Getting links to other pages to scrape, this will be 
        // a primitive array that will be easily returned from page.evaluate
        var links = [].map.call(document.querySelectorAll("#questions .question-hyperlink"), function(link) {
          return link.href;
        });    
        return links;
    });
    // Now to iterate over that array of links
    this.each(listOfLinks, function(self, eachPageHref) {
        object[date][eachPageHref] = []; // array for page to store names
        self.thenOpen(eachPageHref, function () {
            // Getting comments from each page, also as an array
            var listOfItems = this.evaluate(function() {
                var items = [].map.call(document.getElementsByClassName("comment-text"), function(comment) {
                    return comment.innerText;
                });    
                return items;
            });
            object[date][eachPageHref] = listOfItems;
        });
    });
    // After each links has been scraped, output the resulting object
    this.then(function(){
        console.log(JSON.stringify(object));
    });
})
casper.run();

变化:page.evaluate现在返回简单数组,这是casper.each()正确迭代所需要的。在page.evaluate中立即提取href属性。还有这个更正:

 object[date][eachPageHref] = listOfItems; // previously assigned items which were undefined in this scope

脚本运行的结果是

{"1478596579898":{"http://stackoverflow.com/questions/40410927/phantomjs-from-node-on-windows":["en.wikipedia.org/wiki/File_URI_scheme – Igor 2 days agon","@Igor is there something in particular you see wrong, or are you suggesting the phantom module has an incorrect URI? – Danny Buonocore 2 days agon","Probably windows security issue not allowing to run an unsigned program. – Vaviloff yesterdayn"],"http://stackoverflow.com/questions/40412726/casperjs-iterating-over-a-list-of-links-using-casper-each":["Thanks, this looked really promising. I made the changes but it didn't solve the problem. And I just realised that in debug mode the following happens: Creating new array object for https://example.com [debug] [phantom] Navigation requested: url=about:blank, type=Other, willNavigate=true, isMainFrame=true and then Casperjs silently fails. It seems that the correct link that gets passed into thenOpen gets changed to about:blank... – cyc665 yesterdayn"]}}

您在evaluate()函数中返回DOM节点,这是不允许的。您可以返回实际的url。

注意:求值函数的参数和返回值必须是简单的原语对象。经验法则:如果它可以通过JSON序列化,那么它就可以了。

闭包、函数、DOM节点等将无法工作!

参考文献:PhantomJS#evaluate

如果我正确理解你的问题,要解决,给items[]一个全局作用域。在您的代码中,我将执行以下操作:

var items = [];
this.each(listOfLinks, function(self, link) {
    var eachPageHref = link.href;
    console.log("Creating new array in object for " + eachPageHref);
    object[date][eachPageHref] = []; // array for page to store names
    self.thenOpen(eachPageHref, function () {
        this.evaluate(function() {
        // Perform DOM manipulation to get items
        items.push(whateverThisItemIs);
      });
    });

最新更新