Python - BeautifulSoup -如何返回两个或多个具有不同属性的不同元素?



HTML示例

<html>
<div book="blue" return="abc">
<h4 class="link">www.example.com</h4>
<p class="author">RODRIGO</p>
</html>

Ex1:

url = urllib.request.urlopen(url) 
page_soup = soup(url.read(), "html.parser")  
res=page_soup.find_all(attrs={"class": ["author","link"]}) 
for each in res: 
print(each)

Result1:编写此表达式

www.example.com罗德里戈


Ex2:

url = urllib.request.urlopen(url) 
page_soup = soup(url.read(), "html.parser")  
res=page_soup.find_all(attrs={"book": ["blue"]}) 
for each in res: 
print(each["return")

结果2:

abc


! !难题! !

我的问题是如何在一个查询中返回3个结果?

结果3

www.example.com罗德里戈美国广播公司(abc)

HTML的例子似乎被打破了-假设div包裹了其他标签,它可能不是唯一的书,你可以选择所有的书:

for e in soup.find_all(attrs={"book": ["blue"]}):
print(' '.join(e.stripped_strings),e.get('return')) 

from bs4 import BeautifulSoup
html = '''
<html>
<div book="blue" return="abc">
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>
'''
soup = BeautifulSoup(html)
for e in soup.find_all(attrs={"book": ["blue"]}):
print(' '.join(e.stripped_strings),e.get('return'))

www.rodrigo.com RODRIGO abc

一个更结构化的例子可以是:

data = []
for e in soup.select('[book="blue"]'):
data.append({
'link':e.h4.text,
'author':e.select_one('.author').text,
'return':e.get('return')
})
data

输出:

[{'link': 'www.rodrigo.com', 'author': 'RODRIGO', 'return': 'abc'}]

对于一个属性对应多个值的情况,建议使用正则表达式方法:

from bs4 import BeautifulSoup
import re
html = """<html>
<div book="blue" return="abc">
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>"""

soup = BeautifulSoup(html, 'lxml')
by_clss = soup.find_all(class_=re.compile(r'link|author'))
print(b_clss)
为了获得更大的灵活性,可以使用自定义查询函数可传递给findfind_all:
from bs4 import BeautifulSoup
html = """<html>
<div href="blue" return="abc"></div> <!-- div need a closing tag in a html-doc-->
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>"""

def query(tag):
if tag.has_attr('class'):
# tag['class'] is a list. Here assumed that has only one value
return set(tag['class']) <= {'link', 'author'}
if tag.has_attr('book'):
return tag['book'] in {'blue'}
return False

print(soup.find_all(query))
# [<div book="blue" return="abc"></div>, <h4 class="link">www.rodrigo.com</h4>, <p class="author">RODRIGO</p>]

注意你的html-sample没有结束div标签。在第二种情况下,我加入了它,否则汤…不好吃。

编辑要检索满足属性同时条件的元素,查询可以如下所示:

def query_by_attrs(**tag_kwargs):
# tag_kwargs: {attr: [val1, val2], ...}
def wrapper(tag):
for attr, values in tag_kwargs.items():
if tag.has_attr(attr):
# check if tag has multi-valued attributes (class,...)
if not isinstance((tag_attr:=tag[attr]), list): # := for python >=3.8
tag_attr = (tag_attr,) # as tuple
return bool(set(tag_attr).intersection(values)) # false if empty set
return wrapper

q_data = {'class': ['link', 'author'], 'book': ['blue']}
results = soup.find_all(query_by_attrs(**q_data))
print(results)

从网站中提取所有链接

import requests
from bs4 import BeautifulSoup
url = 'https://mixkit.co/free-stock-music/hip-hop/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))

最新更新