在美丽的python上排除不需要的标签


<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 </span>

如何打印"我喜欢你的脸",而不是"我喜欢打你的脸"

我尝试了这个

lala = soup.find_all('span')
for p in lala:
 if not p.find(class_='unwanted'):
    print p.text

但是它给了 " TypeError:find()不采用关键字参数"

您可以在获得文本之前使用extract()删除不需要的标签。

但它保留所有'n'spaces,因此您需要一些工作才能删除它们。

data = '''<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 <span>'''
from bs4 import BeautifulSoup as BS
soup = BS(data, 'html.parser')
external_span = soup.find('span')
print("1 HTML:", external_span)
print("1 TEXT:", external_span.text.strip())
unwanted = external_span.find('span')
unwanted.extract()
print("2 HTML:", external_span)
print("2 TEXT:", external_span.text.strip())

结果

1 HTML: <span>
  I Like
  <span class="unwanted"> to punch </span>
   your face
 <span></span></span>
1 TEXT: I Like
   to punch 
   your face
2 HTML: <span>
  I Like
   your face
 <span></span></span>
2 TEXT: I Like
   your face

您可以跳过外部跨度内的每个Tag对象,并仅保留NavigableString对象(这是HTML中的纯文本)。

data = '''<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 <span>'''
from bs4 import BeautifulSoup as BS
import bs4
soup = BS(data, 'html.parser')
external_span = soup.find('span')
text = []
for x in external_span:
    if isinstance(x, bs4.element.NavigableString):
        text.append(x.strip())
print(" ".join(text))

结果

I Like your face

您可以轻松地找到(UN)所需的文本:

from bs4 import BeautifulSoup
text = """<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 <span>"""
soup = BeautifulSoup(text, "lxml")
for i in soup.find_all("span"):
    if 'class' in i.attrs:
        if "unwanted" in i.attrs['class']:
            print(i.text)

从这里输出所有其他所有内容都可以轻松完成

最新更新