用BeautifulSoup清除HTML数据



我正在Coursera上一门关于Python的课程。有一项任务,我要抓取一个html网页,并在我的代码中使用它。

这是代码:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('http://py4e-data.dr-chuck.net/comments_828036.html')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('span')
sum = 0
for tag in tags:
sum = sum+int(tag.contents[0])
print (sum) 

我正在使用OnlineGDB作为我的编译器在编译和运行时,出现了一个问题:

Traceback (most recent call last):                                                                                                             
File "main.py", line 11, in <module>                                                                                                         
html = urllib.request.urlopen(url).read()                                                                                                  
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen                                                                            
return opener.open(url, data, timeout)                                                                                                     
File "/usr/lib/python3.4/urllib/request.py", line 448, in open                                                                               
req = Request(fullurl, data)                                                                                                               
File "/usr/lib/python3.4/urllib/request.py", line 266, in __init__                                                                           
self.full_url = url                                                                                                                        
File "/usr/lib/python3.4/urllib/request.py", line 292, in full_url                                                                           
self._parse()                                                                                                                              
File "/usr/lib/python3.4/urllib/request.py", line 321, in _parse                                                                             
raise ValueError("unknown url type: %r" % self.full_url)                                                                                   
ValueError: unknown url type: ''

现在,有人能解释一下这个问题是什么以及需要的解决方案吗?

问题似乎出在这一行:

url = input('http://py4e-data.dr-chuck.net/comments_828036.html')

python中的input()允许用户输入要在代码中使用的内容。传递给输入函数的参数(在本例中为url(将是在提示用户输入时显示的文本。例如,age = input('Enter your age -> ')会这样提示用户:

Enter your age -> #you would enter it here, then the age variable would be assigned the input

无论如何,你似乎根本不需要输入。因此,修复代码所要做的就是从代码中删除输入,并直接将url分配给url变量,如下所示:

url = 'http://py4e-data.dr-chuck.net/comments_828036.html'

最终代码:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'http://py4e-data.dr-chuck.net/comments_828036.html'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('span')
sum = 0
for tag in tags:
sum = sum+int(tag.contents[0])
print (sum) 
Output: 2525

在线查看并运行

此外,使用requests模块可以稍微简化您的代码:

import requests
from bs4 import BeautifulSoup
url = 'http://py4e-data.dr-chuck.net/comments_828036.html'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
tags = soup('span')
sum = 0
for tag in tags:
sum += int(tag.contents[0])
print(sum) 

最新更新