从外部网站获取元描述

我需要提取外部网站的元描述。我已经搜索过，也许简单的答案已经存在，但是我无法将其应用于我的代码。

目前，我可以获得以下内容的标题：

external_sites_html = urllib.request.urlopen(url)
soup = BeautifulSoup(external_sites_html)
title = soup.title.string

但是，描述有点棘手。它可以以：

的形式出现

<meta name="og:description" content="blabla"
<meta property="og:description" content="blabla"
<meta name="description" content="blabla"

所以我想要的是提取出现在HTML内部的其中第一个。然后将其添加到数据库中：

entry.description = extracted_desc
entry.save

如果它根本找不到任何描述，则只能保存标题。

您可以在汤对象上使用find方法，并找到具有特定属性的标签。在这里，我们需要在name属性等于og:description或description或property属性等于description的meta标签。

# First get the meta description tag
description = soup.find('meta', attrs={'name':'og:description'}) or soup.find('meta', attrs={'property':'description'}) or soup.find('meta', attrs={'name':'description'})
# If description meta tag was found, then get the content attribute and save it to db entry
if description:
    entry.description = description.get('content')

您可以做这样的事情：

# Order these in order of preference
description_selectors = [
    {"name": "description"},
    {"name": "og:description"},
    {"property": "description"}
]
for selector in description_selectors:
    description_tag = soup.find(attrs=selector)
    if description_tag and description_tag.get('content'):
        description = description_tag['content']
        break
else:
    desciption = ''

只需注意，其他原因是for，而不是if。

相关内容

最新更新

热门标签：