修复xml文件中无效的字符，无效的xmlChar值2[9]

我正在解析来自webservice的xml文件，偶尔会遇到以下错误:

xml2:::read_xml.raw(rs$content) # where the object rs is the response from the webservice, obtained using the httr package
Error in read_xml.raw(x, encoding = encoding, ...) : 
xmlParseCharRef: invalid xmlChar value 2 [9]

我下载了数千个xml文件，只有几个是坏的。'

我的问题是:

如何定位响应中导致错误的字符。修复由无效xmlchar引起的无效xml的一般策略是什么?

我已经通过将响应作为html来避免了这个问题，但我宁愿修复这个问题并解析为xml

谢谢!

我可以通过以下操作来找出它:

首先查看http响应

的内容

xml_broken <- readBin(rs$content, what = "character")

然后我能够系统地从损坏的xml中删除数据，直到我最终找到导致问题的这段文本:

"&#x2;" # from the context i could see that this should be parsed as the danish character 'æ'

from https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references我可以看到这实际上应该被编码为

"aelig;"

所以最后http内容可以通过来解析

rs$content %>% 
readBin(what = "character") %>% 
gsub(pattern = "&#x2;", replacement = "aelig;") %>%
XML::xmlParse()

相关内容

最新更新

热门标签：