我刚刚发现了美丽的汤,看起来非常强大。我想知道是否有一种简单的方法可以用文本提取" alt"字段。一个简单的例子是
from bs4 import BeautifulSoup
html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet" /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())
这将导致
在乐团的不同部分中,您会发现:
a在字符串中
a在黄铜中
a在木管乐器中
,但我想在文本提取内有alt字段,这将给出
在乐团的不同部分中,您会发现:
琴弦中的小提琴
黄铜中的小号
木管乐器中的单簧管和萨克斯管
谢谢
请考虑此方法。
from bs4 import BeautifulSoup
html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet" /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
ptag = soup.find_all('p') # get all tags of type <p>
for tag in ptag:
instrument = tag.find('img') # search for <img>
if instrument: # if we found an <img> tag...
# ...create a new string with the content of 'alt' in the middle if 'tag.text'
temp = tag.text[:2] + instrument['alt'] + tag.text[2:]
print(temp) # print
else: # if we haven't found an <img> tag we just print 'tag.text'
print(tag.text)
输出为
Among the different sections of the orchestra you will find:
A violin in the strings
A trumpet in the brass
A clarinet and saxophone in the woodwinds
策略是:
- 找到所有
<p>
标签 - 在这些
<p>
标签中搜索<img>
标签 - 如果我们找到并
<img>
标签将其alt
属性的内容插入tag.text
中并将其打印出来 - 如果我们找不到
<img>
标签,只需打印出
a = soup.findAll('img')
for every in a:
print(every['alt'])
这将完成工作。
1.line找到所有IMG(我们使用.find ash ash all (
或文本
print (a.text)
for eachline in a:
print(eachline.text)
简单的循环,通过每个结果或手动soup.findAll('img')[0]
然后 soup.findAll('img')[1]
..等等
如果要使用一般解决方案,则可以将函数get_all_text((用作定义的bellow,作为标准get_text((的替代方法:
from bs4.element import Tag, NavigableString
def get_all_text(element, separator=u"", strip=False):
"""
Get all child strings, including images alt text, concatenated using the given separator.
"""
strings = []
for descendant in element.descendants:
if isinstance(descendant, NavigableString):
string = str(descendant.string)
elif isinstance(descendant, Tag) and descendant.name == 'img':
string = descendant.attrs.get('alt', '')
else:
continue
if strip:
string = string.strip()
if string != '':
strings.append(string)
return separator.join(strings)
使用此解决方案,您还可以定义一个自定义分离器,并选择是否要剥离字符串,例如标准get_text((。它也将在不同的情况下起作用。
在您的示例中,就是这样:
from bs4 import BeautifulSoup
html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet" /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(get_all_text(soup))
输出:
Among the different sections of the orchestra you will find:
A violin in the strings
A trumpet in the brass
A clarinet and saxophone in the woodwinds