如何在Python中正确解析带有请求的XML URL



我想从URL解析一个XML文件。

通过执行以下操作:

req = requests.get('https://www.forbes.com/news_sitemap.xml')

我得到的不是正确的XML文件,而是:

<!doctype html>
<html lang="en">
<head>
<meta http-equiv="Content-Language" content="en_US">
<script type="text/javascript">
(function () {
function isValidUrl(toURL) {
// Regex taken from welcome ad.
return (toURL || '').match(/^(?:https?:?//)?(?:[^.(){}\/]*)?.?forbes.com(?:/|?|$)/i);
}
function getUrlParameter(name) {
name = name.replace(/[[]/, '\[').replace(/[]]/, '\]');
var regex = new RegExp('[\?&]' + name + '=([^&#]*)');
var results = regex.exec(location.search);
return results === null ? '' : decodeURIComponent(results[1].replace(/+/g, ' '));
};
function consentIsSet(message) {
console.log(message);
var result = JSON.parse(message.data);
if(result.message == "submit_preferences"){
var toURL = getUrlParameter("toURL");
if(!isValidUrl(toURL)){
toURL = "https://www.forbes.com/";
}
location.href=toURL;
}
}
var apiObject = {
PrivacyManagerAPI:
{
action: "getConsent",
timestamp: new Date().getTime(),
self: "forbes.com"
}
};
var json = JSON.stringify(apiObject);
window.top.postMessage(json,"*");
window.addEventListener("message", consentIsSet, false);
})();
</script>
</head>
<div id='teconsent'>
<script async="async" type="text/javascript" crossorigin src='//consent.truste.com/notice?domain=forbes.com&c=teconsent'></script>
</div>
<body>
</body>
</html>

是否还有更好的方法来处理XML文件(例如,如果它被压缩,或者如果文件太大,则通过递归解析…(?谢谢

如果您提供cookie来请求您可以获取XML文件,则此网站会检查该cookie是否为GDPR。试试这个代码,对我来说很好。

import requests
url = "https://www.forbes.com/news_sitemap.xml"
news_sitemap = requests.get(url, headers={"Cookie": "notice_gdpr_prefs=0,1,2:1a8b5228dd7ff0717196863a5d28ce6c"})
print(news_sitemap.text)

使用请求模块获取xml文件。然后,您可以使用xml解析器库来执行您想要的操作。

import requests
url = "https://www.forbes.com/news_sitemap.xml"
x = requests.get(url)
print(x.text)

最新更新