如何使用 Jsoup 选择存在于同一树级别的两个(或多个)HTML 元素?



我正在做一个项目,遇到了一个问题。我需要从包含以下 HTML 代码的网站中抓取数据:

<div class="lin-curso" style="border: 0;">
<div class="lin-area-c3">
Vagas 2017
</div>
</div>
<div class="box10">
<div class="lin-area-c1">
L160
</div>
<div class="lin-area-c2">
Acupuntura
</div>
<div class="lin-area-c3">
[Lic-1º cic]
</div>
</div>
<div class="lin-curso">
<div class="lin-curso-c1">
&nbsp;
</div>
<div class="lin-curso-c2">
3155
</div>
<div class="lin-curso-c3">
<a href="detcursopi.asp?codc=L160&amp;code=3155" title="3155/L160">Instituto Politécnico de Setúbal - Escola Superior de Saúde</a>
</div>
<div class="lin-curso-c4">
20
</div>
</div>
<br>
<div class="box10">
<div class="lin-area-c1">
9059
</div>
<div class="lin-area-c2">
Administração e Gestão de Empresas
</div>
<div class="lin-area-c3">
[Lic-1º cic]
</div>
</div>
<div class="lin-curso">
<div class="lin-curso-c1">
&nbsp;
</div>
<div class="lin-curso-c2">
2270
</div>
<div class="lin-curso-c3">
<a href="detcursopi.asp?codc=9059&amp;code=2270" title="2270/9059">Universidade Católica Portuguesa - Faculdade de Ciências Económicas e Empresariais</a>
</div>
<div class="lin-curso-c4">
n.d.
</div>
</div>
<br>
<div class="box10">
<div class="lin-area-c1">
8056
</div>
<div class="lin-area-c2">
Administração e Gestão Pública
</div>
<div class="lin-area-c3">
[Lic-1º cic]
</div>
</div>
<div class="lin-curso">
<div class="lin-curso-c1">
&nbsp;
</div>
<div class="lin-curso-c2">
4275
</div>
<div class="lin-curso-c3">
<a href="detcursopi.asp?codc=8056&amp;code=4275" title="4275/8056">Instituto Superior de Ciências da Administração</a>
</div>
<div class="lin-curso-c4">
20
</div>
</div>
<br>
<div class="box10">
<div class="lin-area-c1">
8194
</div>
<div class="lin-area-c2">
Administração da Guarda Nacional Republicana
</div>
<div class="lin-area-c3">
[Mest Integ]
</div>
</div>
<div class="lin-curso">
<div class="lin-curso-c1">
&nbsp;
</div>
<div class="lin-curso-c2">
7510
</div>
<div class="lin-curso-c3">
<a href="detcursopi.asp?codc=8194&amp;code=7510" title="7510/8194">Academia Militar</a>
</div>
<div class="lin-curso-c4">
n.d.
</div>
</div>
<br>
<div class="box10">
<div class="lin-area-c1">
9672
</div>
<div class="lin-area-c2">
Administração e Marketing
</div>
<div class="lin-area-c3">
[Lic-1º cic]
</div>
</div>

BOX10 和 line-curso 应该形成一个元素,但它们不会。 因为在某些行中,一个 Lin-curso 只有一个 BOX10,但有些行就像一个 Box10 的 Lin-curso 一样,如果 Box10 和 Lin-curso 是一个元素,就不会有问题,有没有办法将这两者联系起来?

编辑:网站链接是这个:http://www.dges.gov.pt/guias/indcurso.asp?letra=A

元素是".inside">

当您使用同级选择器时,解决此问题相当容易。在您的情况下,具有类box10的div 在表中扮演标题的角色,而具有类lin-curso的同级div 扮演表数据行的角色。我建议首先选择所有具有类box10的div:

Elements boxes = doc.select("div.box10");

然后,您可以遍历boxes并执行两项主要操作:

  1. 从这个div中提取你感兴趣的数据(它包含3个子节点,divs的类lin-area-c1lin-area-c2lin-area-c3(
  2. 选择具有类lin-curso的同级节点并从中提取数据。

Jsoup 提供了一个名为Element.nextElementSibling()的方法,它将同级元素返回到你调用此方法的元素。因此,当您在元素div.box10上调用它时,您将获得同级元素div.lin-curso

在这种情况下,同级是指紧跟在同一树级别的指定节点之后的节点。

示例性解决方案

您可以在下面找到解析给定网站并将表打印到控制台输出的示例代码:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
final class TestMain {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("http://www.dges.gov.pt/guias/indcurso.asp?letra=A").get();
Elements boxes = doc.select("div.box10");
for (Element box : boxes) {
String linAreaC1 = box.select(".lin-area-c1").text();
String linAreaC2 = box.select(".lin-area-c2").text();
String linAreaC3 = box.select(".lin-area-c3").text();
System.out.printf("%s: %s %s%n", linAreaC1, linAreaC2, linAreaC3);
Element linCurso = box.nextElementSibling();
while (linCurso.hasClass("lin-curso")) {
String linCursoC2 = linCurso.select(".lin-curso-c2").text();
String linCursoC3 = linCurso.select(".lin-curso-c3").text();
String linCursoC4 = linCurso.select(".lin-curso-c4").text();
System.out.printf("%st%st%s%n", linCursoC2, linCursoC3, linCursoC4);
linCurso = linCurso.nextElementSibling();
}
System.out.println("==============================");
}
}
}

我希望它有所帮助。

最新更新