使用 Dom Java 标记化 XML 元素的文本内容



我有一个XML文件,其中包含诸如以下内容的标签:

<P>(b) <E T="03">Filing of financial reports.</E> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,</p>

我需要解析文本内容并将结果作为字符串数组返回["(b)", "Filing of financial reports.", "(1)(i) Except as provided in paragraphs (b) (3) and (h) of this section,"].

换句话说,我需要根据<E T=03">标记<p>元素的文本内容,并将结果存储在字符串数组中。

没有什么可以"标记化"的,因为在构建 DOM 时已经为您完成了解析。 <P>节点包含文本节点和子节点。 这是 DOM 的样子:

P
|
+---text "(b) "
|
+---E
|   |
|   +---attribute T=03
|   |
|   +---text "Filing of financial reports."
|
+---text "Except as provided ..."

要获得所需的结果,您需要浏览<P>的子节点并提取所有文本节点。

这里有

一种方法可以使用jsoup库来做到这一点:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
class Test {
  public static void main(String args[]) throws Exception {
    String xml = "<P>(b) <E T="03">Filing of financial reports.</E> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,</p>";
    Document doc = Jsoup.parse(xml);
    for (Element e : doc.select("p"))
      for (Node child : e.childNodes()) {
        if (child instanceof TextNode) {
          System.out.println(((TextNode) child).text());
        } else {
          System.out.println(((Element) child).text());
        }
      }
  }
}

输出:

(b) 
Filing of financial reports.
 (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,

使用 XPath。如果你不想使用专门的Java库,你可以只使用标准的Java API,比如我们:

import java.io.ByteArrayInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
public class ExtractingAllTextNodes {
    private static final String XML = "<P>(b) <E T="03">Filing of financial reports.</E> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,</P>";
    public static void main(final String[] args) throws Exception {
        final XPath xPath = XPathFactory.newInstance().newXPath();
        final DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
        final DocumentBuilder builder = builderFactory.newDocumentBuilder();
        final String expression = "//text()";
        final Document xmlDocument = builder.parse(new ByteArrayInputStream(XML.getBytes()));
        final NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);
        for (int i = 0; i < nodeList.getLength(); i++) {
            System.out.println("=> " + nodeList.item(i).getTextContent());
        }
    }
}

输出:

=> (b) 
=> Filing of financial reports.
=>  (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,

根据您的需要,您可以更改 XPath 表达式。

好的。我终于设法找到了解决问题的方法。代码有点复杂,但它使用 Dom,它是 XML 解析的标准库:

public static void parseSection(Element sec){
    NodeList pTags = ((Element) (((NodeList) sec
            .getElementsByTagName("contents")).item(0)))
            .getElementsByTagName("P");
    int pTagIndex = 0;
    while (pTagIndex < pTags.getLength()) {
        System.out.println(pTagIndex);
        Node pTag = pTags.item(pTagIndex);
        NodeList pTagChildren = pTag.getChildNodes();
        int pTagChildrenIndex = 0;
        while(pTagChildrenIndex < pTagChildren.getLength()){
            Node pTagChild = pTagChildren.item(pTagChildrenIndex);
            if(pTagChild.getNodeName().equals("#text")){
                System.out.println("Text: " + pTagChild.getNodeValue());
            } else if(pTagChild.getNodeName().equals("E")){
                System.out.println("E: " + pTagChild.getTextContent());
            }
            pTagChildrenIndex ++;
        }

最新更新