Python BeautifulSoup挑选网页,相同的代码在开和关



我使用相同的代码来拾取网络文本,但大多数时候它显示"警告:根:某些字符无法解码,并被替换为替换字符",令人惊讶的是,有时它可以工作,例如我运行代码 12 次,1 次成功。

相同的

代码,相同的网址。为什么会这样?

from bs4 import BeautifulSoup
import re
import urllib2

url = "http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
web_p = soup.find_all('span',class_='url')
for web in web_p:
    print web 

引用详细信息如下:

Traceback (most recent call last):
  File "C:Python27libidlelibrun.py", line 112, in main
seq, request = rpc.request_queue.get(block=True, timeout=0.05)
  File "C:Python27libQueue.py", line 176, in get
    raise Empty
Empty

感谢 isedev 的指导,以及 python urllib2 会自动解压缩从网页获取的 gzip 数据吗?中的答案,代码更改为以下工作:

from StringIO import StringIO
import gzip
from bs4 import BeautifulSoup
import re
import urllib2

request = urllib2.Request('http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age')
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
    buf = StringIO( response.read())
    f = gzip.GzipFile(fileobj=buf)
    data = f.read()
soup = BeautifulSoup(data)
web_p = soup.find_all('span',class_='url')
for web in web_p:
    print web


多亏了Blender的指导,代码可以简化:

from bs4 import BeautifulSoup
import requests
html = requests.get('http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age').text
soup = BeautifulSoup(html)
web_p = soup.find_all('span',class_='url')
for web in web_p:
    print web

最新更新