我想使用名为BeautifulSoup的库来抓取网站的内容,使用以下代码:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.cora.fr"
request_page = urlopen(url)
page_html = request_page.read()
request_page.close()
html_soup = BeautifulSoup (page_html,'html.parser')
print(html_soup.prettify())
我得到这个OutPut:
<html style="height:100%">
<head>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="initial-scale=1.0" name="viewport"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<script async="" src="/aginnied-Russiuerall-is-in-Now-I-and-haue-of-per">
</script>
</head>
<body style="margin:0px;height:100%">
<iframe frameborder="0" height="100%" id="main-iframe" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?SWUDNSAI=31&xinfo=14-57686117-0%20NNNY%20RT%281647685973667%2052%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%2814%2c0%2c0%29%20U18&incident_id=578000600070663752-301165151237836558&edet=12&cinfo=0e000000ee8c&rpinfo=0&cts=6o3aY0%2bK9yRMZVnfRogZQ5mdlFz%2f4pTp9kkaulxxrjzj29yFMZc4CDDz3DEQhaUm&mth=GET" width="100%">
Request unsuccessful. Incapsula incident ID: 578000600070663752-301165151237836558
</iframe>
</body>
</html>
正文显示了不同的信息,而不是页面的实际内容,我该如何修复?
html内容明确指出Request unsuccessful. Incapsula incident
。Incapsula允许网站根据位置进行屏蔽。
当我试图在另一个国家的Chrome上打开网站时,它会抛出一个captcha。尝试使用类似以下代码中的标头,然后重试:
req = urllib2.Request("https://www.cora.fr", None, {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'})
response = urllib2.urlopen(req).read()
或者你可以在你的代码中使用代理