如何在不包含网站数据的情况下从网页中获取准确的标题



我找到了这个链接[和其他一些链接],它谈到了一些关于BeautifulSoup的阅读html的内容。它主要是做我想做的事,为网页获取标题。

def get_title(url):
html = requests.get(url).text
if len(html) > 0:
contents = BeautifulSoup(html)
title = contents.title.string
return title
return None

我遇到的问题是,有时文章返回时会在末尾附加元数据"-some_data";。一个很好的例子是英国广播公司体育频道的一篇文章的链接,该文章将标题报道为

杰克·查尔顿:1966年英格兰世界杯冠军去世,享年85岁-BBC体育

我可以做一些简单的事情,比如切断最后一个字符之后的任何内容

title = title.rsplit(', ', 1)[0]

但这假设任何元都存在于"-"价值我不想假设永远不会有一篇文章的标题以"-part_of_title";

我找到了Newspaper3k库,但它绝对超出了我的需求——我所需要的只是获取一个标题,并确保它与用户发布的内容相同。我的朋友给我指了指Newspaper3k,他也提到它可能有缺陷,而且并不总能正确找到标题,所以如果可能的话,我倾向于使用其他东西。

我目前的想法是继续使用BeautifulSoup,只添加fuzzywuzzy,老实说,这也有助于解决轻微的拼写错误或标点符号差异。但是,我当然更愿意从一个包括与准确标题进行比较的地方开始。

以下是reddit如何处理获取标题数据的问题。

https://github.com/reddit-archive/reddit/blob/40625dcc070155588d33754ef5b15712c254864b/r2/r2/lib/utils/utils.py#L255

def extract_title(data):
"""Try to extract the page title from a string of HTML.
An og:title meta tag is preferred, but will fall back to using
the <title> tag instead if one is not found. If using <title>,
also attempts to trim off the site's name from the end.
"""
bs = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
if not bs or not bs.html.head:
return
head_soup = bs.html.head
title = None
# try to find an og:title meta tag to use
og_title = (head_soup.find("meta", attrs={"property": "og:title"}) or
head_soup.find("meta", attrs={"name": "og:title"}))
if og_title:
title = og_title.get("content")
# if that failed, look for a <title> tag to use instead
if not title and head_soup.title and head_soup.title.string:
title = head_soup.title.string
# remove end part that's likely to be the site's name
# looks for last delimiter char between spaces in strings
# delimiters: |, -, emdash, endash,
#             left- and right-pointing double angle quotation marks
reverse_title = title[::-1]
to_trim = re.search(u's[u00abu00bbu2013u2014|-]s',
reverse_title,
flags=re.UNICODE)
# only trim if it won't take off over half the title
if to_trim and to_trim.end() < len(title) / 2:
title = title[:-(to_trim.end())]
if not title:
return
# get rid of extraneous whitespace in the title
title = re.sub(r's+', ' ', title, flags=re.UNICODE)
return title.encode('utf-8').strip()

最新更新