我正在使用Horsmenan抓取网站,以便使用提取的数据构建一些图表。
我设法用我的代码获取了每个重要部分的根元素,但我不知道如何浏览其中的每个元素。我想做的是用一些子元素构建 json,例如:
- 公司名称
- 公司堆栈(包含 UL 列表(
这是我到目前为止的代码:
router.get('/', function(req, res, next) {
//All the web scraping magic will happen here
var url = "http://www.welcometothejungle.co/stacks?q=&hPP=30&idx=cms_companies_stacks_production&p=";
const pages = [0,1,2,3,4];
pages.forEach((page) => {
const horseman = new Horseman();
horseman
.open(url + '' + page)
.html('article')
.then((text) => {
console.log(`${text}`);
})
.close();
});
res.render('index', {title :"Done"});
});
如何浏览"文本"结果变量?
我设法使用另一个名为cheerio的模块解析数据。如果你有办法和骑士一起做,这可能会很有趣!
horseman
.open(url + '' + page)
.html('article')
.then((htmlRes) => {
if(htmlRes){
var item = {}; //container for one article info
//Loading data in cheerio to parse it
var $ = cheerio.load(htmlRes);
//First step : get the title
$('h4[class=company-name]').each(function(i, elem) {
//Delete the span jobs
//Delete the number of jobs
var t = $(this).text().replace(/s+/g, '').replace(/d+/g, '');
//Delete the word jobs
var tRes = t.substr(0, t.length-4);
item.company = tRes;
});
var stacksF = []; //container for the list of different stacks categories
$('div[class=company-stack-category]').each(function(i, elem) {
var obj = {}; //one stack categorie
var stacks = []; //list of item in the aimed stack
obj.stackName = $(this).children('.category-title').text(); //name of the stakc
$(this).children('.company-stack-list').children('.stack-item').each(function(c, elem) {
stacks[c] = $(this).text(); //one stack element
});
var stacksRes = stacks.join(', '); //join all the stack element in a unique string
stacksRes = stacksRes.replace(/s+/g, '');
obj.stackValue = stacksRes;
stacksF.push(obj);
});
item.stacks = stacksF;
articles.push(item);
}else{
console.log("Impossible to retrieve data");
}
})
.close();