我正在解析来自webservice的xml文件,偶尔会遇到以下错误:
xml2:::read_xml.raw(rs$content) # where the object rs is the response from the webservice, obtained using the httr package
Error in read_xml.raw(x, encoding = encoding, ...) :
xmlParseCharRef: invalid xmlChar value 2 [9]
我下载了数千个xml文件,只有几个是坏的。'
我的问题是:
如何定位响应中导致错误的字符。修复由无效xmlchar引起的无效xml的一般策略是什么?
我已经通过将响应作为html来避免了这个问题,但我宁愿修复这个问题并解析为xml
谢谢!
我可以通过以下操作来找出它:
首先查看http响应
的内容xml_broken <- readBin(rs$content, what = "character")
然后我能够系统地从损坏的xml中删除数据,直到我最终找到导致问题的这段文本:
"" # from the context i could see that this should be parsed as the danish character 'æ'
from https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references我可以看到这实际上应该被编码为
"aelig;"
所以最后http内容可以通过
来解析rs$content %>%
readBin(what = "character") %>%
gsub(pattern = "", replacement = "aelig;") %>%
XML::xmlParse()