我使用的是Java 6。我有一个XML模板,它像一样开始
<?xml version="1.0" encoding="UTF-8"?>
然而,当我用以下代码(使用Apache Commons io 2.4)解析并输出它时,我注意到…
Document doc = null;
InputStream in = this.getClass().getClassLoader().getResourceAsStream(“my-template.xml”);
try
{
byte[] data = org.apache.commons.io.IOUtils.toByteArray( in );
InputSource src = new InputSource(new StringReader(new String(data)));
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(src);
}
finally
{
in.close();
}
第一行输出为
<?xml version="1.0" encoding="UTF-16”?>
在解析/输出文件时,我需要做什么才能使头编码保持"UTF-8"?
编辑:根据给出的建议,我将代码更改为
Document doc = null;
InputStream in = this.getClass().getClassLoader().getResourceAsStream(name);
try
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(in);
}
finally
{
in.close();
}
尽管事实上我的输入元素模板文件的第一行是
<?xml version="1.0" encoding="UTF-8"?>
当我将文档输出为字符串时,它会产生
<?xml version="1.0" encoding="UTF-16"?>
作为第一行。以下是我用来将"doc"对象输出为字符串的内容。。。
private String getDocumentString(Document doc)
{
DOMImplementationLS domImplementation = (DOMImplementationLS)doc.getImplementation();
LSSerializer lsSerializer = domImplementation.createLSSerializer();
return lsSerializer.writeToString(doc);
}
new StringReader(new String(data))
这是错误的。您应该让解析器使用(例如)DocumentBuilder.parse(InputStream):来检测文档编码
doc = builder.parse(in);
DOM的编码方式取决于如何编写。内存中的DOM没有编码的概念。
将文档写入具有UTF-8声明的字符串:
import java.io.StringWriter;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.ls.*;
public class DomIO {
public static void main(String[] args) throws Exception {
Document doc = DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.newDocument();
doc.appendChild(doc.createElement("foo"));
System.out.println(getDocumentString(doc));
}
public static String getDocumentString(Document doc) {
DOMImplementationLS domImplementation = (DOMImplementationLS)
doc.getImplementation();
LSSerializer lsSerializer = domImplementation.createLSSerializer();
LSOutput lsOut = domImplementation.createLSOutput();
lsOut.setEncoding("UTF-8");
lsOut.setCharacterStream(new StringWriter());
lsSerializer.write(doc, lsOut);
return lsOut.getCharacterStream().toString();
}
}
如果您希望序列化程序在输出时对文档进行正确编码,则LSOutput还支持二进制流。
当我将Document->String方法更改为时
private String getDocumentString(Document doc)
{
String ret = null;
DOMSource domSource = new DOMSource(doc);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer;
try
{
transformer = tf.newTransformer();
transformer.transform(domSource, result);
ret = writer.toString();
}
catch (TransformerConfigurationException e)
{
e.printStackTrace();
}
catch (TransformerException e)
{
e.printStackTrace();
}
return ret;
}
'encoding="UTF-8"'标头不再输出为'encoding="UTF-16"'。