如何使用Beautiful汤从html页面获取链接url



我有一个HTML页面,里面有多个div,比如:

<td class="b-list__main">
<a data-gtm="Page B list" href="C.php?bsn=31888&amp;snA=773&amp;tnum=2" class="b-list__main__title">【info】10/23 develop note-new character</a><span class="b-list__main__icon"><i title="有圖片" class="material-icons icon-photo"></i></span>
</td>

我是python和BeautifulSoup的新手,我正在尝试从这个类中获取所有URL。我试过:

for lastpage in root.find_all("td", class_="b-list__main"):
print(lastpage.p)

输出:

<p class="b-list__main__title" data-gtm="Page B list" href="C.php?bsn=31888&amp;snA=773&amp;tnum=2">【info】10/23 develop note-new character</p>
<p class="b-list__main__title" data-gtm="Page B list" href="C.php?bsn=31888&amp;snA=774&amp;tnum=1">【Q】alient team choice</p>
<p class="b-list__main__title" data-gtm="Page B list" href="C.php?bsn=31888&amp;snA=772&amp;tnum=1">【Q】lock account question</p>

我的理想输出是得到最大的数字,例如774。但我一步一个脚印,试着先获取网址,然后再获取号码。

C.php?bsn=31888&amp;snA=773&amp;tnum=2
C.php?bsn=31888&amp;snA=774&amp;tnum=1
C.php?bsn=31888&amp;snA=772&amp;tnum=1

我也试过:

for lastpage in root.find_all("td", class_="b-list__main"):
link = lastpage.fine('p',href=True)
if link is None:
continue
print(lastpage.p['href'])

但得到TypeError: 'NoneType' object is not subscriptable

感谢您的帮助。

我的代码:

import bs4
import re
def getData(url):
request = req.Request(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 "
})
with req.urlopen(request) as response:
data = response.read().decode("utf-8")
root = bs4.BeautifulSoup(data, "html.parser")
for lastpage in root.find_all("td", class_="b-list__main"):

print(lastpage.p)
url = "https://forum.gamer.com.tw/B.php?bsn=31888"
getData(url)

我从未见过具有href属性的p标记,但如果html代码是这样的,您可以尝试以下操作:

from bs4 import BeautifulSoup
import re
html = """
<p class="b-list__main__title" data-gtm="Page B list" href="C.php?bsn=31888&amp;snA=773&amp;tnum=2">【info】10/23 develop note-new character</p>
<p class="b-list__main__title" data-gtm="Page B list" href="C.php?bsn=31888&amp;snA=774&amp;tnum=1">【Q】alient team choice</p>
<p class="b-list__main__title" data-gtm="Page B list" href="C.php?bsn=31888&amp;snA=772&amp;tnum=1">【Q】lock account question</p>
"""
root = BeautifulSoup(html,'html5lib')
links_lst = []
for lastpage in root.find_all("p"):
links_lst.append(lastpage['href'])

输出:

>>> links_lst
['C.php?bsn=31888&snA=773&tnum=2', 'C.php?bsn=31888&snA=774&tnum=1', 'C.php?bsn=31888&snA=772&tnum=1']

为了找到最大的数字,您可以使用一点regex。只需将这些行添加到上面提供的代码中:

pattern = re.compile('(?<=snA=).*d{3}')
num_lst = []
for link in links_lst:
num_lst.append(int(pattern.findall(link)[0]))
print(f"Largest Number = {max(num_lst)} , Full link = {links_lst[num_lst.index(max(num_lst))]}")

输出:

Largest Number = 774 , Full link = C.php?bsn=31888&snA=774&tnum=1

编辑:

这是完整的代码:

import bs4
import re
from urllib import request as req
links_lst = []
def getData(url):
request = req.Request(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 "
})
with req.urlopen(request) as response:
data = response.read().decode("utf-8")
root = bs4.BeautifulSoup(data, "html.parser")
for lastpage in root.find_all("div", class_="b-list__tile"):
try:
links_lst.append(lastpage.p['href'])
except:
pass
pattern = re.compile('(?<=snA=).*d{3}')

num_lst = []
for link in links_lst:
num_lst.append(int(pattern.findall(link)[0]))
print(f"Largest Number = {max(num_lst)} , Full link = {links_lst[num_lst.index(max(num_lst))]}")

url = "https://forum.gamer.com.tw/B.php?bsn=31888"
getData(url)

输出:

Largest Number = 774 , Full link = C.php?bsn=31888&snA=774&tnum=1

最新更新