Cheerio从非唯一的HTML类(JS)获取文本



我试图从以下HTML格式的网站抓取信息:

<tr class="odd">
<td class="">    <table class="inline-table">
<tbody><tr>
<td rowspan="2">
<img src="https://img.a.transfermarkt.technology/portrait/medium/881116-1664480529.jpg?lm=1" data-src="https://img.a.transfermarkt.technology/portrait/medium/881116-1664480529.jpg?lm=1" title="Darío Osorio" alt="Darío Osorio" class="bilderrahmen-fixed lazy entered loaded" data-ll-status="loaded">            </td>
<td class="hauptlink">
<a title="Darío Osorio" href="/dario-osorio/profil/spieler/881116">Darío Osorio</a>                            </td>
</tr>
<tr>
<td>Right Winger</td>
</tr>
</tbody></table>
</td><td class="zentriert">18</td><td class="zentriert"><img src="https://tmssl.akamaized.net/images/flagge/verysmall/33.png?lm=1520611569" title="Chile" alt="Chile" class="flaggenrahmen"></td><td class=""><table class="inline-table">
<tbody><tr>
<td rowspan="2">
<a title="Club Universidad de Chile" href="/club-universidad-de-chile/startseite/verein/1037"><img src="https://tmssl.akamaized.net/images/wappen/tiny/1037.png?lm=1420190110" title="Club Universidad de Chile" alt="Club Universidad de Chile" class="tiny_wappen"></a>       </td>
<td class="hauptlink">
<a title="Club Universidad de Chile" href="/club-universidad-de-chile/startseite/verein/1037">U. de Chile</a>       </td>
</tr>
<tr>
<td>
<img src="https://tmssl.akamaized.net/images/flagge/tiny/33.png?lm=1520611569" title="Chile" alt="Chile" class="flaggenrahmen"> <a title="Primera División" href="/primera-division-de-chile/transfers/wettbewerb/CLPD">Primera División</a>        </td>
</tr>
</tbody></table>
</td><td class=""><table class="inline-table">
<tbody><tr>
<td rowspan="2">
<a title="Newcastle United" href="/newcastle-united/startseite/verein/762"><img src="https://tmssl.akamaized.net/images/wappen/tiny/762.png?lm=1472921161" title="Newcastle United" alt="Newcastle United" class="tiny_wappen"></a>     </td>
<td class="hauptlink">
<a title="Newcastle United" href="/newcastle-united/startseite/verein/762">Newcastle</a>        </td>
</tr>
<tr>
<td>
<img src="https://tmssl.akamaized.net/images/flagge/verysmall/189.png?lm=1520611569" title="England" alt="England" class="flaggenrahmen"> <a title="Premier League" href="/premier-league/transfers/wettbewerb/GB1">Premier League</a>      </td>
</tr>
</tbody></table>
</td><td class="rechts">-</td><td class="rechts">€3.00m</td><td class="rechts hauptlink">? </td><td class="zentriert hauptlink"><a title="Darío Osorio to Newcastle United?" id="27730/Newcastle United sent scouts to Chile to follow Dario Osorio. the 18-year-old is being monitored by Barcelona, ​​Wolverhampton and Newcastle United./http://www.90min.com//16127/180/Darío Osorio to Newcastle United?" class="icons_sprite icon-pinnwand-sprechblase sprechblase-wechselwahrscheinlichkeit" href="https://www.transfermarkt.co.uk/dario-osorio-to-newcastle-united-/thread/forum/180/thread_id/16127/post_id/27730#27730">&nbsp;&nbsp;&nbsp;</a></td></tr>

我想刮"Darió Osorio", "de Chile"one_answers";Newcastle"[class="hauptlink"]来自HTML的不同元素的文本。

我已经尝试了一些不同的事情,我最近的尝试看起来像这样:

$('.odd', html).each((index, el) => {
const source = $(el)
const information= source.find('td.main-link').first().text().trim()
const differentInformation= source.find('a:nth-child(1)').text()
})

但我只成功的刮"Darió Osorio"使用first()方法。变量"differentinformation"目前我的代码是这样的:"Darió OsorioU。de ChileNewcastle"。最后我想得到的结果是一个JSON-Object,如下所示:

[ 
{ "firstInfo" : "Darió Osorio",
"secondInfo": "U. de Chile",
"thirdInfo": "Newcastle"
},
{ "firstInfo" : "Information",
"secondInfo": "Different Information",
"thirdInfo": "More Different Information" 
} 
] 

在评论中澄清后,听起来你在寻找这样的东西:

const cheerio = require("cheerio"); // 1.0.0-rc.12
const url = "YOUR URL";
(async () => {
const response = await fetch(url);
if (!response.ok) {
throw Error(response.statusText);
}
const html = await response.text();
const $ = cheerio.load(html);
const data = [...$(".items .odd, .items .even")].map(e => {
const [player, currentClub, interestedClub] =
[...$(e).find(".hauptlink")].map(e => $(e).text().trim());
return {player, currentClub, interestedClub};
});
console.log(data);
})()
.catch(error => console.log(error));

这依赖于.hauptlink,它存在于您有兴趣检索的前3行单元格中,因此这似乎是最直接的解决方案。也许一个更强大的解决方案是选择特定的<td>细胞你想要的。

我不确定我是否正确理解了你的要求。我不太熟悉你们与HTML交互的方式。不过,在我看来,您可以使用Attribute Selector直接获取正确的元素。所以如果你想找到所有带有title="information"的元素它看起来就像这样(就像我说的,我不知道cheerio所以我不能测试它)

$('.odd', html).each((index, el) => {
const source = $(el)
const allInformation = source.find('[title="information"]');
allInformation.each((idx, information) => {
console.log(information.text().trim();
})
})

编辑:现在我想得更多了,你甚至不需要改变你的查询。只是不要使用first(),而是像我上面所做的那样循环你的结果。因为您的查询返回一个数组(这就是为什么您可以执行first()来获取所述数组的第一个元素)。该数组应包含与查询匹配的所有元素。

最新更新