如何在 python 中添加来自 BeautifulSoup 的 html "路径"(标签)作为类实例变量？

我正在尝试使用BeautifulSoup来处理从网站上获取的HTML数据。我创建了一个"网站"类，并具有几个功能，可以根据实例变量（例如标题，类等）来解析HTML脚本。例如

class Websites:
    def __init__(self, url, header, class_):
        self.url = url
        self.header = header
        self.class_ = class_
    def html(self):
        url = self.url
        webpage = urlopen(url)
        page_html = webpage.read()
        webpage.close()
        page_soup = bs(page_html, 'html.parser')
        return page_soup

将这些变量（标题，类）转换为类中的实例变量很简单，但是我正在努力将一个变量转换为类实例变量。我相信美丽的术语术语被称为"标签"。如果我在类的实例上调用上面显示的HTML函数，则可以将一个可以保存为变量（page_soup）的HTML文本块，我可以添加标签，例如这样：

page_soup.div.h1.p

这指定了我要访问的HTML脚本的确切部分。有什么方法可以修改上面显示的类 init 函数，以便进行输入，例如：

amazon = Websites(url = 'Amazon.co.uk', tag = '.div.h1.p')

并将其用作类方法中的实例变量，作为self.tag？

以这种方式访问标签与使用BeautifulSoup的find()函数相同，后者返回第一个匹配标签。因此，您可以编写自己的函数以模拟这种方法，如下所示：

from bs4 import BeautifulSoup
def get_tag(tag, text_attr):
    for attr in text_attr.split('.'):
        if attr:
            tag = tag.find(attr)
    return tag

html = """<html><h2>test1</h2><div><h1>test2<p>display this</p></h1></div></html>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.div.h1.p)
print(get_tag(soup, '.div.h1.p'))

这将显示：

<p>display this</p>
<p>display this</p>

另一种方法是使用.select()函数，该功能返回匹配标签的列表：

print(soup.select('div > h1 > p')[0])

相关内容

最新更新

热门标签：