Beautiful Soup-获取外部标记文本而不获取内部标记文本



我在HTML文件中有一些带有嵌套<a>标记的<li>标记,并且列表和<a>标记中都有文本。但是,我想单独提取它们。我希望<li>文本成为键tag的值,并且<a>标记内的文本成为子级tag的键的值。(HTML片段见下文(

我最终将其打印到JSON文件中,但得到了不想要的结果。主CCD_;抽象可视化";。。。不是所有其他的东西。儿童标签应该只有";大约";,不是"/情绪化的和抽象的";紧随其后;情绪化的和抽象的";已经在标题中占有一席之地。您可以看到这个索引示例的每个条目都显示出相同的模式。如何将文本提取到正确的位置?我是《靓汤》的初学者。非常感谢。

JSON文件

{
"tag": "abstract visualizationnttttttttttttttttnabout / Emotive and abstractnn",
"definition": "",
"source": [],
"children": [
{
"tag": "about / Emotive and abstract",
"definition": "",
"source": [
{
"title": "Emotive and abstract",
"href": "https://learning.oreilly.com/library/view/data-visualization-a/9781849693462/ch02s03.html"
}
]
}
]
},
{
"tag": "Adobe After EffectsnttttttttttttttttnURL / Other specialist toolsnn",
"definition": "",
"source": [],
"children": [
{
"tag": "URL / Other specialist tools",
"definition": "",
"source": [
{
"title": "Other specialist tools",
"href": "https://learning.oreilly.com/library/view/data-visualization-a/9781849693462/ch06.html"
}
]
}
]
},

HTML文件片段:

<ul id="letters">
<li>abstract visualization
<ul>
<li>about / <a href="ch02s03.html" title="Emotive and abstract" class="link">Emotive and abstract</a></li>
</ul>
</li>
<li>Adobe After Effects
<ul>
<li>URL / <a href="ch06.html" title="Other specialist tools" class="link">Other specialist tools</a></li>
</ul>
</li>
<li>Adobe Flash
<ul>
<li>about / <a href="ch06.html" title="Programming environments" class="link">Programming environments</a></li>
<li>URL / <a href="ch06.html" title="Programming environments" class="link">Programming environments</a></li>
</ul>
</li>
<li>Adobe Illustrator
<ul>
<li>about / <a href="ch06.html" title="Other specialist tools" class="link">Other specialist tools</a></li>
<li>URL / <a href="ch06.html" title="Other specialist tools" class="link">Other specialist tools</a></li>
</ul>
</li>
</ul>

相关代码:

# convert html to bs4 object
def bs4_convert(file):
with open(file, encoding='utf8') as fp:
html = BeautifulSoup(fp, 'html.parser')
return html
# create a tag
def li_parser(letter, link_prefix):
tags = []
for li in letter.find_all('li', recursive=False):
tag = {
'tag': li.text,
'definition': '',
'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in li.find_all('a', recursive=False)]
}
if li.find('ul'):
tag['children'] = li_parser(li.find('ul'), link_prefix)
tags.append(tag)
return tags
# loop through all indices
def html_parser(html, link_prefix):
tags = []
# extract index
html.find(id='backindex')
# iterate over every indented letter in index
letters = html.find_all(attrs={'id': 'letters'})
for letter in letters:
tags += li_parser(letter, link_prefix)
return tags
tags = []
# parse the html
html = bs4_convert(course['file'])
# create tags
tags = html_parser(html, link_prefix)
# add course name as outermost tag
tags = add_course_tag(course['course'], tags)

可以通过名为contents的列表访问标记的子级。在您的情况下,您正在搜索的文本只是contents[0],因此它比循环遍历所有子项更容易。您只需要使用strip()删除不需要的选项卡和行

soup=BeautifulSoup(data, 'lxml')
lis=soup.select('#letters > li')
for li in lis:
print(li.contents[0].strip())
sub_li=li.select_one('ul li')
print(sub_li.contents[0].strip()[:-2]) #get rid of the trailing slash

输出

abstract visualization
about
Adobe After Effects
URL
Adobe Flash
about
Adobe Illustrator
about

要为您的标签获得正确的字符串,您可以在选择第一个元素时使用类似于@diggusbickus的stripped_strings方法:

'tag': list(li.stripped_strings)[0].strip(' /')

示例

def li_parser(letter, link_prefix):
tags = []
for li in letter.find_all('li', recursive=False):
tag = {
'tag': list(li.stripped_strings)[0].strip(' /'),
'definition': '',
'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in li.find_all('a', recursive=False)]
}
if li.find('ul'):
tag['children'] = li_parser(li.find('ul'), link_prefix)
tags.append(tag)
return tags

输出

[{"tag": "abstract visualization", "definition": "", "source": [], "children": [{"tag": "about", "definition": "", "source": [{"title": "Emotive and abstract", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch02s03.html"}]}]}, {"tag": "Adobe After Effects", "definition": "", "source": [], "children": [{"tag": "URL", "definition": "", "source": [{"title": "Other specialist tools", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}]}, {"tag": "Adobe Flash", "definition": "", "source": [], "children": [{"tag": "about", "definition": "", "source": [{"title": "Programming environments", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}, {"tag": "URL", "definition": "", "source": [{"title": "Programming environments", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}]}, {"tag": "Adobe Illustrator", "definition": "", "source": [], "children": [{"tag": "about", "definition": "", "source": [{"title": "Other specialist tools", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}, {"tag": "URL", "definition": "", "source": [{"title": "Other specialist tools", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}]}]

最新更新