我有一个XML文件,其中包含诸如以下内容的标签:
<P>(b) <E T="03">Filing of financial reports.</E> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,</p>
我需要解析文本内容并将结果作为字符串数组返回["(b)", "Filing of financial reports.", "(1)(i) Except as provided in paragraphs (b) (3) and (h) of this section,"]
.
换句话说,我需要根据<E T=03">
标记<p>
元素的文本内容,并将结果存储在字符串数组中。
没有什么可以"标记化"的,因为在构建 DOM 时已经为您完成了解析。 <P>
节点包含文本节点和子节点。 这是 DOM 的样子:
P
|
+---text "(b) "
|
+---E
| |
| +---attribute T=03
| |
| +---text "Filing of financial reports."
|
+---text "Except as provided ..."
要获得所需的结果,您需要浏览<P>
的子节点并提取所有文本节点。
一种方法可以使用jsoup库来做到这一点:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
class Test {
public static void main(String args[]) throws Exception {
String xml = "<P>(b) <E T="03">Filing of financial reports.</E> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,</p>";
Document doc = Jsoup.parse(xml);
for (Element e : doc.select("p"))
for (Node child : e.childNodes()) {
if (child instanceof TextNode) {
System.out.println(((TextNode) child).text());
} else {
System.out.println(((Element) child).text());
}
}
}
}
输出:
(b)
Filing of financial reports.
(1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,
使用 XPath。如果你不想使用专门的Java库,你可以只使用标准的Java API,比如我们:
import java.io.ByteArrayInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
public class ExtractingAllTextNodes {
private static final String XML = "<P>(b) <E T="03">Filing of financial reports.</E> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,</P>";
public static void main(final String[] args) throws Exception {
final XPath xPath = XPathFactory.newInstance().newXPath();
final DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
final DocumentBuilder builder = builderFactory.newDocumentBuilder();
final String expression = "//text()";
final Document xmlDocument = builder.parse(new ByteArrayInputStream(XML.getBytes()));
final NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);
for (int i = 0; i < nodeList.getLength(); i++) {
System.out.println("=> " + nodeList.item(i).getTextContent());
}
}
}
输出:
=> (b)
=> Filing of financial reports.
=> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,
根据您的需要,您可以更改 XPath 表达式。
好的。我终于设法找到了解决问题的方法。代码有点复杂,但它使用 Dom,它是 XML 解析的标准库:
public static void parseSection(Element sec){
NodeList pTags = ((Element) (((NodeList) sec
.getElementsByTagName("contents")).item(0)))
.getElementsByTagName("P");
int pTagIndex = 0;
while (pTagIndex < pTags.getLength()) {
System.out.println(pTagIndex);
Node pTag = pTags.item(pTagIndex);
NodeList pTagChildren = pTag.getChildNodes();
int pTagChildrenIndex = 0;
while(pTagChildrenIndex < pTagChildren.getLength()){
Node pTagChild = pTagChildren.item(pTagChildrenIndex);
if(pTagChild.getNodeName().equals("#text")){
System.out.println("Text: " + pTagChild.getNodeValue());
} else if(pTagChild.getNodeName().equals("E")){
System.out.println("E: " + pTagChild.getTextContent());
}
pTagChildrenIndex ++;
}