如何在 xml 声明标记处拆分格式不正确的 xml 文件，以将其写入单独的 xml 文件中以解析它们

我的问题：我想解析大量的大型xml文件并将数据写入mysql数据库。问题是，所有这些 xml 文件的格式都不正确，因为权威机构将多个 xml 文件合并到一个 xml 文件中并发布它们。所以我的 SAX 解析器非常适合单个 xml 文件抛出错误，他无法处理包含多个 xml 声明的 xml 文件（xml 版本......

抛出的错误消息：

线程 "main" org.xml.sax.SAXParseException; systemId 中的异常： ....."[xX][mM][lL]" .....

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0535456-20070123.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20070110" date-publ="20070123">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>D0535456</doc-number>
<kind>S1</kind>
<date>20070123</date>
</document-id>
</publication-reference>
<us-application-series-code>29</us-application-series-code>
</us-bibliographic-data-grant>
</us-patent-grant>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0535457-20070123.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20070110" date-publ="20070123">
<us-bibliographic-data-grant>
...

由于我正在研究几个论坛和网站，唯一清醒的解决方案是读取 xml 文件将其拆分为根标记并将其写入单独的 xml 文件中？如何在不使用SAX/Stax/DOM解析的情况下读取和写入xml文件？

结果应为：XML 文件 1：

?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0535456-20070123.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20070110" date-publ="20070123">
<us-bibliographic-data-grant>
...
</us-bibliographic-data-grant>
</us-patent-grant>

XML 文件 2：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0535457-20070123.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20070110" date-publ="20070123">
<us-bibliographic-data-grant>
...

由于您的文件中有多个 xml 文档，因此它不是真正的 xml 文件。它只是一个文件。因此，您可以使用任何您喜欢读取文件的内容（例如FileReader）来读取它。

另一种选择是扩展读取器或流，并创建一个处理具有多个 xml 文档的文件的新类。它需要：

当找到新的xml文档时返回文件结尾，这将告诉解析器它已完成当前文档
允许在文件伪结束后继续读取，以便可以读取下一个 xml 文档
句柄关闭，以便仅在读取整个文件时关闭，可能还需要某种强制关闭选项

类似的东西...

import java.io.BufferedReader;
import java.io.IOException;
import java.io.Reader;
public class ConcatenatedXmlReader extends BufferedReader {
    private String nextLine = "";
    public ConcatenatedXmlReader(Reader reader, int size) {
        super(reader, size);
    }
    public ConcatenatedXmlReader(Reader reader) {
        super(reader);
    }
    private boolean seenXmlStart = false;
    // which method you need to override probably depends on which sax parser you use
    @Override
    public int read(char[] buffer, int offset, int length) throws IOException {
        readNextLine();
        if (nextLine == null) {
            return -1;
        }
        if (nextLine.startsWith("<?xml")) {
            if (seenXmlStart) {
                return -1;
            }
            seenXmlStart = true;
        }
        int addToBuffer = Math.min(nextLine.length(), length);
        for (int i = 0; i < addToBuffer; i++) {
            buffer[i] = nextLine.charAt(i);
        }
        nextLine = (addToBuffer < nextLine.length()) ? nextLine.substring(addToBuffer) : "";
        return addToBuffer;
    }
    public boolean hasXmlDocuments() throws IOException {
        readNextLine();
        seenXmlStart = false;
        return nextLine != null &&  nextLine.length() > 0;
    }
    private void readNextLine() throws IOException {
        if (nextLine != null && nextLine.length() == 0) {
            nextLine = readLine();
        }
    }
    @Override
    public void close() throws IOException {
        // override so it doesn't close the file when there are still more xml documents.
        if (nextLine != null) {
            return;
        }
        super.close();
    }
}

然后，您将多次调用 sax 解析器，同时文件中有更多 xml 文档。

例如

        SAXParserFactory factory = SAXParserFactory.newInstance();
        MyHandler handler = new MyHandler();
        ConcatenatedXmlReader reader = new ConcatenatedXmlReader(new FileReader(inputFile));
        SAXParser saxParser = factory.newSAXParser();
        while (reader.hasXmlDocuments()) {
            saxParser.parse(new InputSource(reader), handler);
        }

相关内容

最新更新

热门标签：