刮后与骑士一起浏览html



我正在使用Horsmenan抓取网站,以便使用提取的数据构建一些图表。
我设法用我的代码获取了每个重要部分的根元素,但我不知道如何浏览其中的每个元素。我想做的是用一些子元素构建 json,例如:

  • 公司名称
  • 公司堆栈(包含 UL 列表(

这是我到目前为止的代码:

router.get('/', function(req, res, next) {
  //All the web scraping magic will happen here
  var url = "http://www.welcometothejungle.co/stacks?q=&hPP=30&idx=cms_companies_stacks_production&p=";
  const pages = [0,1,2,3,4];
  pages.forEach((page) => {
    const horseman = new Horseman();
    horseman
        .open(url + '' + page)
        .html('article')
        .then((text) => {
            console.log(`${text}`);
        })
        .close();
  });
  res.render('index', {title :"Done"});
});

如何浏览"文本"结果变量?

我设法使用另一个名为cheerio的模块解析数据。如果你有办法和骑士一起做,这可能会很有趣!

horseman
        .open(url + '' + page)
        .html('article')
        .then((htmlRes) => {
          if(htmlRes){
            var item = {}; //container for one article info
            //Loading data in cheerio to parse it
            var $ = cheerio.load(htmlRes);
            //First step : get the title
            $('h4[class=company-name]').each(function(i, elem) {
              //Delete the span jobs
              //Delete the number of jobs
              var t = $(this).text().replace(/s+/g, '').replace(/d+/g, '');
              //Delete the word jobs
              var tRes = t.substr(0, t.length-4);
              item.company = tRes;
            });
            var stacksF = []; //container for the list of different stacks categories
            $('div[class=company-stack-category]').each(function(i, elem) {
              var obj = {}; //one stack categorie
              var stacks = []; //list of item in the aimed stack
              obj.stackName = $(this).children('.category-title').text(); //name of the stakc
              $(this).children('.company-stack-list').children('.stack-item').each(function(c, elem) {
                stacks[c] = $(this).text(); //one stack element
              });
              var stacksRes = stacks.join(', '); //join all the stack element in a unique string
              stacksRes = stacksRes.replace(/s+/g, '');
              obj.stackValue = stacksRes;
              stacksF.push(obj);
            });
            item.stacks = stacksF;
            articles.push(item);
          }else{
            console.log("Impossible to retrieve data");
          }
        })
        .close();

最新更新