如何用美丽汤提取带有文本的"alt"

我刚刚发现了美丽的汤，看起来非常强大。我想知道是否有一种简单的方法可以用文本提取" alt"字段。一个简单的例子是

from bs4 import BeautifulSoup
html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet"  /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())

这将导致

在乐团的不同部分中，您会发现：

a在字符串中

a在黄铜中

a在木管乐器中

，但我想在文本提取内有alt字段，这将给出

在乐团的不同部分中，您会发现：

琴弦中的小提琴

黄铜中的小号

木管乐器中的单簧管和萨克斯管

谢谢

请考虑此方法。

from bs4 import BeautifulSoup
html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet"  /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
ptag = soup.find_all('p')   # get all tags of type <p>
for tag in ptag:
    instrument = tag.find('img')    # search for <img>
    if instrument:  # if we found an <img> tag...
        # ...create a new string with the content of 'alt' in the middle if 'tag.text'
        temp = tag.text[:2] + instrument['alt'] + tag.text[2:]
        print(temp) # print
    else:   # if we haven't found an <img> tag we just print 'tag.text'
        print(tag.text)

输出为

Among the different sections of the orchestra you will find:
A violin in the strings
A trumpet in the brass
A clarinet and saxophone in the woodwinds

策略是：

找到所有<p>标签
在这些<p>标签中搜索<img>标签
如果我们找到并<img>标签将其alt属性的内容插入tag.text中并将其打印出来
如果我们找不到<img>标签，只需打印出

a = soup.findAll('img')
for every in a:
    print(every['alt'])

这将完成工作。

1.line找到所有IMG(我们使用.find ash ash all (

或文本

print (a.text)
for eachline in a:
    print(eachline.text)

简单的循环，通过每个结果或手动soup.findAll('img')[0]然后 soup.findAll('img')[1] ..等等

如果要使用一般解决方案，则可以将函数get_all_text((用作定义的bellow，作为标准get_text((的替代方法：

from bs4.element import Tag, NavigableString
def get_all_text(element, separator=u"", strip=False):
    """
    Get all child strings, including images alt text, concatenated using the given separator.
    """
    strings = []
    for descendant in element.descendants:
        if isinstance(descendant, NavigableString):
            string = str(descendant.string)
        elif isinstance(descendant, Tag) and descendant.name == 'img':
            string = descendant.attrs.get('alt', '')
        else:
            continue
        if strip:
            string = string.strip()
        if string != '':
            strings.append(string)
    return separator.join(strings)

使用此解决方案，您还可以定义一个自定义分离器，并选择是否要剥离字符串，例如标准get_text((。它也将在不同的情况下起作用。

在您的示例中，就是这样：

from bs4 import BeautifulSoup
html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet"  /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(get_all_text(soup))

输出：


Among the different sections of the orchestra you will find:
A violin in the strings
A trumpet in the brass
A clarinet and saxophone in the woodwinds

相关内容

最新更新

热门标签：