从 Python 中的任何站点抓取链接标题的通用方法

有没有一种"通用"方法可以从Python中的任何网站抓取链接标题？例如，如果我使用以下代码：

from urllib.request import url open
from bs4 import BeautifulSoup
site = "https://news.google.com"
html = urlopen(site)
soup = BeautifulSoup(html.read(), 'lxml');
titles = soup.findAll('span', attrs = { 'class' : 'titletext' }) 
for title in titles:
    print(title.contents)

我能够从 news.google.com 中提取几乎所有标题。但是，如果我在 www.yahoo.com 使用相同的代码，由于 HTML 格式不同，我无法这样做。

有没有更通用的方法可以做到这一点，以便它适用于大多数网站？

不，每个网站都是不同的，如果你做一个更通用的抓取工具，它会获得更多的数据，这些数据不像每个标题那样具体。

例如，以下内容将从谷歌获得每个标题，也可能从雅虎获得它们。

titles = soup.find_all('a') 
for title in titles:
    print(title.get_text())

但是，它也会为您提供所有标题和其他链接，这会混淆您的结果。（该Google页面上大约有150个链接不是标题）

不是，这就是为什么我们需要CSS选择器和XPath，但是如果页面数量很少，有一种方便的方法可以做到这一点：

site = "https://news.google.com"
if 'google' in site:
    filters = {'name':'span', "class" : 'titletext' }
elif 'yahoo' in site:
    filters = {'name':'blala', "class" : 'blala' }
titles = soup.findAll(**filters) 
for title in titles:
    print(title.contents)

相关内容

最新更新

热门标签：