我正在尝试从网页中抓取<a>
标签的内容。我的代码是:
from bs4 import BeautifulSoup
import requests
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
url = 'https://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
lessons = soup.find_all('li', class_='toc-level-1')
lesson = lessons[0]
print(lesson)
我的页面有一个元素:(直接从火狐浏览器的 DOM 检查器的输出中获得(...
<li class="toc-level-1 t-toc-level-1 js-content-uri" data-content-uri="/api/v1/book/9780134985961/chapter/LPOC_00_00_00.html">
<a href="/library/view/linux-performance-optimization/9780134985961/LPOC_00_00_00.html" class="t-chapter" tabindex="39">Introduction</a>
<ol>
<li class="toc-level-2 t-toc-level-2 js-content-uri" data-content-uri="/api/v1/book/9780134985961/chapter/LPOC_00_00_00.html"><a href="/library/view/linux-performance-optimization/9780134985961/LPOC_00_00_00.html" class="t-chapter" tabindex="41">Linux Performance Optimization: Introduction</a></li>
</ol>
</li>
但是,当我使用请求和 bs4 模块来抓取数据时,使用上面的代码,我得到的输出是:
<li class="toc-level-1 t-toc-level-1">
<a class="t-chapter js-chapter" href="https://www.safaribooksonline.comhttps://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961/LPOC_00_00_00.html">Introduction</a>
<ol>
<li class="toc-level-2 t-toc-level-2">
<a class="t-chapter js-chapter" href="https://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961/LPOC_00_00_00.html">Linux Performance Optimization: Introduction</a>
</li>
</ol>
</li>
注意到<a>
标签的 href 值了吗?它们应该是相对的URL,例如:/library/view/linux-performance-optimization/9780134985961/LPOC_00_00_00.html
但我得到绝对的URL-有时也太错误了:https://www.safaribooksonline.comhttps://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961/LPOC_00_00_00.html
。
我不知道域名是如何以链接网址为前缀的,因为在原始 HTML 中只给出了 href 值,除非请求或 bs4 这样做。我以前使用相同方法的所有脚本也会产生类似的错误。模块一侧有什么变化,还是我做错了什么?
您可以使用正则表达式从href
中提取 URL:
from bs4 import BeautifulSoup
import requests
import sys
import re
url = 'https://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
hrefs = set()
for lesson in soup.find_all('li', class_='toc-level-1'):
for a in lesson.find_all('a', href=True):
found_urls = re.split(r'(https?://.*?)', a['href'])
hrefs.add(found_urls[-2] + found_urls[-1])
for href in sorted(hrefs):
print(href)
给你一个找到的hrefs列表,开始:
https://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961/LPOC_00_00_00.html
https://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961/LPOC_01_00_00.html
https://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961/LPOC_01_01_00.html
https://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961/LPOC_01_01_01.html