我无法获取网站的内容



我想使用名为BeautifulSoup的库来抓取网站的内容,使用以下代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.cora.fr"
request_page = urlopen(url)
page_html = request_page.read()
request_page.close()
html_soup = BeautifulSoup (page_html,'html.parser')
print(html_soup.prettify())

我得到这个OutPut:

<html style="height:100%">
<head>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="initial-scale=1.0" name="viewport"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<script async="" src="/aginnied-Russiuerall-is-in-Now-I-and-haue-of-per">
</script>
</head>
<body style="margin:0px;height:100%">
<iframe frameborder="0" height="100%" id="main-iframe" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?SWUDNSAI=31&amp;xinfo=14-57686117-0%20NNNY%20RT%281647685973667%2052%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%2814%2c0%2c0%29%20U18&amp;incident_id=578000600070663752-301165151237836558&amp;edet=12&amp;cinfo=0e000000ee8c&amp;rpinfo=0&amp;cts=6o3aY0%2bK9yRMZVnfRogZQ5mdlFz%2f4pTp9kkaulxxrjzj29yFMZc4CDDz3DEQhaUm&amp;mth=GET" width="100%">
Request unsuccessful. Incapsula incident ID: 578000600070663752-301165151237836558
</iframe>
</body>
</html>

正文显示了不同的信息,而不是页面的实际内容,我该如何修复?

html内容明确指出Request unsuccessful. Incapsula incident。Incapsula允许网站根据位置进行屏蔽。

当我试图在另一个国家的Chrome上打开网站时,它会抛出一个captcha。尝试使用类似以下代码中的标头,然后重试:

req = urllib2.Request("https://www.cora.fr", None, {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'})
response = urllib2.urlopen(req).read()

或者你可以在你的代码中使用代理

相关内容

  • 没有找到相关文章

最新更新