<p> <div> 使用 Jsoup 获取之后和之间的所有文本<h2>


<h2><span class="mw-headline" id="The_battle">The battle</span></h2>
<div class="thumb tright"></h2>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<h2>Second Title I want to stop collecting p tags after</h2>

我正在学习Jsoup,试图删除维基百科网站上所有按标题排列的p标签。借助这个问题,我可以废弃h2之间的所有p标签:
使用jsoup从两个标签之间提取未识别的html内容?正则表达式

通过使用

Elements elements = docx.select("span.mw-headline, h2 ~ p");

但当他们之间有CCD_ 1时,我不能废弃它。以下是我正在开发的维基百科网站:https://simple.wikipedia.org/wiki/Battle_of_Hastings

如何获取两个特定h2标签之间的所有p标签?最好按id订购。

尝试此选项:Elements Elements=doc.select("span.mw-headline,h2~div,h2~p");

示例代码:

package jsoupex;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
/**
* Example program to list links from a URL.
*/
public class stackoverflw {
public static void main(String[] args) throws IOException {
//Validate.isTrue(args.length == 1, "usage: supply url to fetch");
//String url = "http://localhost/stov_wiki.html";
String url = "https://simple.wikipedia.org/wiki/Battle_of_Hastings ";
//args[0];
System.out.println("Fetching %s..." + url);
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("span.mw-headline, h2 ~ div, h2 ~ p");
for (Element elem : elements) {
if ( elem.hasClass("mw-headline")) {
System.out.println("************************");
}
System.out.println(elem.text());
if ( elem.hasClass("mw-headline")) {
System.out.println("************************");
} else {
System.out.println("");
}           
}
}
}
public static void main(String[] args) {
String entity =
"<h2><span class="mw-headline" id="The_battle">The battle</span></h2>" +
"<div class="thumb tright"></h2>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<h2>Second Title I want to stop collecting p tags after</h2>";
Document element = org.jsoup.Jsoup.parse(entity,"", Parser.xmlParser());
element.outputSettings().prettyPrint(false);
element.outputSettings().outline(false);
List<TextNode>text=getAllTextNodes(element);
}
private static List<TextNode> getAllTextNodes(Element newElementValue) {
List<TextNode>textNodes = new ArrayList<>();
Elements elements = newElementValue.getAllElements();
for (Element e : elements){
for (TextNode t : e.textNodes()){
textNodes.add(t);

}
}
return textNodes;
}

最新更新