使用Jsoup解析HTML以获取单个元素的文本

我需要解析下面的文本，并为每个文本创建单独的对象。我尝试了几种方法，但它没有以我需要的格式提供结果。

文本为：

String text = "This is start of a text&nbsp;<a href="https://google.com/sample">followed by a link&nbsp;sample</a>and ending with some text."

使用以下代码：

Document document = Jsoup.parse(text);
Elements elements = document.select("*");
for(Element e : elements){
System.out.println( e.tagName() + ": " + e.text());}

实际结果为

root: This is start of a text followed by a link sampleand ending with some text.
html: This is start of a text followed by a link sampleand ending with some text.
head: 
body: This is start of a text followed by a link sampleand ending with some text.
p: This is start of a text followed by a link sampleand ending with some text.
a: followed by a link sample

我需要得到以下结果，这样我就可以为每个文本创建一个自定义对象

body: This is start of a text&nbsp;
a:followed by a link&nbsp;sample
body:and ending with some text.

为了避免返回所有子级的文本，请使用e.ownText()，但在这种情况下这还不够，因为您希望有单独的This is start of a text和and ending with some text.，但ownText()会返回其已联接：This is start of a text and ending with some text.
要获得分隔文本的列表，请使用e.textNodes()，正文的输出将为：

body: [
This is start of a text&nbsp;, and ending with some text.]
a: [followed by a link&nbsp;sample]

还有一个额外的优势是你保留了原来的CCD_ 7
此外，如果您不喜欢在文档中添加多余的html: []和head: []，则应该使用XML解析器：

Document document = Jsoup.parse(text, "", Parser.xmlParser());

为了保持文本分离和<a>文本的顺序，尝试对每个节点使用：document.childNodes()和childNodes()递归迭代。您可以通过检查if (node instanceof TextNode)来识别文本节点。

相关内容

最新更新

热门标签：