如何用美丽汤提取带有文本的"alt"



我刚刚发现了美丽的汤,看起来非常强大。我想知道是否有一种简单的方法可以用文本提取" alt"字段。一个简单的例子是

from bs4 import BeautifulSoup
html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet"  /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())

这将导致

在乐团的不同部分中,您会发现:

a在字符串中

a在黄铜中

a在木管乐器中

,但我想在文本提取内有alt字段,这将给出

在乐团的不同部分中,您会发现:

琴弦中的小提琴

黄铜中的小号

木管乐器中的单簧管和萨克斯管

谢谢

请考虑此方法。

from bs4 import BeautifulSoup
html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet"  /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
ptag = soup.find_all('p')   # get all tags of type <p>
for tag in ptag:
    instrument = tag.find('img')    # search for <img>
    if instrument:  # if we found an <img> tag...
        # ...create a new string with the content of 'alt' in the middle if 'tag.text'
        temp = tag.text[:2] + instrument['alt'] + tag.text[2:]
        print(temp) # print
    else:   # if we haven't found an <img> tag we just print 'tag.text'
        print(tag.text)

输出为

Among the different sections of the orchestra you will find:
A violin in the strings
A trumpet in the brass
A clarinet and saxophone in the woodwinds

策略是:

  1. 找到所有<p>标签
  2. 在这些<p>标签中搜索<img>标签
  3. 如果我们找到并<img>标签将其alt属性的内容插入tag.text中并将其打印出来
  4. 如果我们找不到<img>标签,只需打印出
a = soup.findAll('img')
for every in a:
    print(every['alt'])

这将完成工作。

1.line找到所有IMG(我们使用.find ash ash all (

或文本

print (a.text)
for eachline in a:
    print(eachline.text)

简单的循环,通过每个结果或手动soup.findAll('img')[0]然后 soup.findAll('img')[1] ..等等

如果要使用一般解决方案,则可以将函数get_all_text((用作定义的bellow,作为标准get_text((的替代方法:

from bs4.element import Tag, NavigableString
def get_all_text(element, separator=u"", strip=False):
    """
    Get all child strings, including images alt text, concatenated using the given separator.
    """
    strings = []
    for descendant in element.descendants:
        if isinstance(descendant, NavigableString):
            string = str(descendant.string)
        elif isinstance(descendant, Tag) and descendant.name == 'img':
            string = descendant.attrs.get('alt', '')
        else:
            continue
        if strip:
            string = string.strip()
        if string != '':
            strings.append(string)
    return separator.join(strings)

使用此解决方案,您还可以定义一个自定义分离器,并选择是否要剥离字符串,例如标准get_text((。它也将在不同的情况下起作用。

在您的示例中,就是这样:

from bs4 import BeautifulSoup
html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet"  /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(get_all_text(soup))

输出:


Among the different sections of the orchestra you will find:
A violin in the strings
A trumpet in the brass
A clarinet and saxophone in the woodwinds

最新更新