如何使用BeautifulSoup获取最后一个URL链接元素



如何使用BeautifulSoup从给定页面获取最后一个html链接?我正在尝试获取一个包含lenta.ru的链接。但是,如果一个网页包含多个lenta.ru,它会打印每个lenta.ru.但是,我只想获取最后一个lenta.ru链接,它是翻译的指针链接。

我得到这些结果

http://lenta.ru/news/2012/09/03/ipsos/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/news/2012/09/04/endofobama/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/news/2012/09/04/response/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://www.lenta.ru/articles/2012/09/05/threat/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/articles/2012/08/21/terranova/ https://uynaa.wordpress.com/2012/08/23/%d1%85%d2%af%d0%bd-%d0%b1%d0%b0-%d0%bc%d3%a9%d1%81/

预期输出

http://www.lenta.ru/articles/2012/09/05/threat/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/articles/2012/08/21/terranova/ https://uynaa.wordpress.com/2012/08/23/%d1%85%d2%af%d0%bd-%d0%b1%d0%b0-%d0%bc%d3%a9%d1%81/

我的代码

import re
import requests
from lxml import html
from bs4 import BeautifulSoup
from urllib.request import urlopen
with open("./uynaa.txt") as inFile:
uynaa_txt = inFile.readlines()
for tmp in uynaa_txt:
html = urlopen(tmp).read()
soup = BeautifulSoup(html, "lxml")
for a in soup.select('div.entry a'):
if "lenta.ru" in a.get('href', ''):
print(a, tmp)

uynaa.txt

https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
https://uynaa.wordpress.com/2012/08/23/%d1%85%d2%af%d0%bd-%d0%b1%d0%b0-%d0%bc%d3%a9%d1%81/

解决方案

soup.select('div.entry a')[-1]

解释

soup.select返回一个列表。您可以使用[-1].检索列表中的最后一项。如果页面只有一个匹配的链接,则最后一项也将是第一项,但这不会给您带来任何问题。

# full working code
from bs4 import BeautifulSoup
example_page = """
<body>
<a href="http://lenta.ru/news/2012/09/03/ipsos/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/"></a>
<a href="http://lenta.ru/news/2012/09/04/endofobama/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/" ></a>
<a href="http://lenta.ru/news/2012/09/04/response/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/" ></a>
</body>
"""
soup = BeautifulSoup(example_page, "lxml")
print(soup.body.select("a")[-1])

最新更新