我正在Coursera上一门关于Python的课程。有一项任务,我要抓取一个html网页,并在我的代码中使用它。
这是代码:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('http://py4e-data.dr-chuck.net/comments_828036.html')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('span')
sum = 0
for tag in tags:
sum = sum+int(tag.contents[0])
print (sum)
我正在使用OnlineGDB作为我的编译器在编译和运行时,出现了一个问题:
Traceback (most recent call last):
File "main.py", line 11, in <module>
html = urllib.request.urlopen(url).read()
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 448, in open
req = Request(fullurl, data)
File "/usr/lib/python3.4/urllib/request.py", line 266, in __init__
self.full_url = url
File "/usr/lib/python3.4/urllib/request.py", line 292, in full_url
self._parse()
File "/usr/lib/python3.4/urllib/request.py", line 321, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: ''
现在,有人能解释一下这个问题是什么以及需要的解决方案吗?
问题似乎出在这一行:
url = input('http://py4e-data.dr-chuck.net/comments_828036.html')
python中的input()
允许用户输入要在代码中使用的内容。传递给输入函数的参数(在本例中为url(将是在提示用户输入时显示的文本。例如,age = input('Enter your age -> ')
会这样提示用户:
Enter your age -> #you would enter it here, then the age variable would be assigned the input
无论如何,你似乎根本不需要输入。因此,修复代码所要做的就是从代码中删除输入,并直接将url分配给url
变量,如下所示:
url = 'http://py4e-data.dr-chuck.net/comments_828036.html'
最终代码:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'http://py4e-data.dr-chuck.net/comments_828036.html'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('span')
sum = 0
for tag in tags:
sum = sum+int(tag.contents[0])
print (sum)
Output: 2525
在线查看并运行
此外,使用requests
模块可以稍微简化您的代码:
import requests
from bs4 import BeautifulSoup
url = 'http://py4e-data.dr-chuck.net/comments_828036.html'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
tags = soup('span')
sum = 0
for tag in tags:
sum += int(tag.contents[0])
print(sum)