我在HTML文件中有一些带有嵌套<a>
标记的<li>
标记,并且列表和<a>
标记中都有文本。但是,我想单独提取它们。我希望<li>
文本成为键tag
的值,并且<a>
标记内的文本成为子级tag
的键的值。(HTML片段见下文(
我最终将其打印到JSON文件中,但得到了不想要的结果。主CCD_;抽象可视化";。。。不是所有其他的东西。儿童标签应该只有";大约";,不是"/情绪化的和抽象的";紧随其后;情绪化的和抽象的";已经在标题中占有一席之地。您可以看到这个索引示例的每个条目都显示出相同的模式。如何将文本提取到正确的位置?我是《靓汤》的初学者。非常感谢。
JSON文件
{
"tag": "abstract visualizationnttttttttttttttttnabout / Emotive and abstractnn",
"definition": "",
"source": [],
"children": [
{
"tag": "about / Emotive and abstract",
"definition": "",
"source": [
{
"title": "Emotive and abstract",
"href": "https://learning.oreilly.com/library/view/data-visualization-a/9781849693462/ch02s03.html"
}
]
}
]
},
{
"tag": "Adobe After EffectsnttttttttttttttttnURL / Other specialist toolsnn",
"definition": "",
"source": [],
"children": [
{
"tag": "URL / Other specialist tools",
"definition": "",
"source": [
{
"title": "Other specialist tools",
"href": "https://learning.oreilly.com/library/view/data-visualization-a/9781849693462/ch06.html"
}
]
}
]
},
HTML文件片段:
<ul id="letters">
<li>abstract visualization
<ul>
<li>about / <a href="ch02s03.html" title="Emotive and abstract" class="link">Emotive and abstract</a></li>
</ul>
</li>
<li>Adobe After Effects
<ul>
<li>URL / <a href="ch06.html" title="Other specialist tools" class="link">Other specialist tools</a></li>
</ul>
</li>
<li>Adobe Flash
<ul>
<li>about / <a href="ch06.html" title="Programming environments" class="link">Programming environments</a></li>
<li>URL / <a href="ch06.html" title="Programming environments" class="link">Programming environments</a></li>
</ul>
</li>
<li>Adobe Illustrator
<ul>
<li>about / <a href="ch06.html" title="Other specialist tools" class="link">Other specialist tools</a></li>
<li>URL / <a href="ch06.html" title="Other specialist tools" class="link">Other specialist tools</a></li>
</ul>
</li>
</ul>
相关代码:
# convert html to bs4 object
def bs4_convert(file):
with open(file, encoding='utf8') as fp:
html = BeautifulSoup(fp, 'html.parser')
return html
# create a tag
def li_parser(letter, link_prefix):
tags = []
for li in letter.find_all('li', recursive=False):
tag = {
'tag': li.text,
'definition': '',
'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in li.find_all('a', recursive=False)]
}
if li.find('ul'):
tag['children'] = li_parser(li.find('ul'), link_prefix)
tags.append(tag)
return tags
# loop through all indices
def html_parser(html, link_prefix):
tags = []
# extract index
html.find(id='backindex')
# iterate over every indented letter in index
letters = html.find_all(attrs={'id': 'letters'})
for letter in letters:
tags += li_parser(letter, link_prefix)
return tags
tags = []
# parse the html
html = bs4_convert(course['file'])
# create tags
tags = html_parser(html, link_prefix)
# add course name as outermost tag
tags = add_course_tag(course['course'], tags)
可以通过名为contents
的列表访问标记的子级。在您的情况下,您正在搜索的文本只是contents[0]
,因此它比循环遍历所有子项更容易。您只需要使用strip()
删除不需要的选项卡和行
soup=BeautifulSoup(data, 'lxml')
lis=soup.select('#letters > li')
for li in lis:
print(li.contents[0].strip())
sub_li=li.select_one('ul li')
print(sub_li.contents[0].strip()[:-2]) #get rid of the trailing slash
输出
abstract visualization
about
Adobe After Effects
URL
Adobe Flash
about
Adobe Illustrator
about
要为您的标签获得正确的字符串,您可以在选择第一个元素时使用类似于@diggusbickus的stripped_strings
方法:
'tag': list(li.stripped_strings)[0].strip(' /')
示例
def li_parser(letter, link_prefix):
tags = []
for li in letter.find_all('li', recursive=False):
tag = {
'tag': list(li.stripped_strings)[0].strip(' /'),
'definition': '',
'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in li.find_all('a', recursive=False)]
}
if li.find('ul'):
tag['children'] = li_parser(li.find('ul'), link_prefix)
tags.append(tag)
return tags
输出
[{"tag": "abstract visualization", "definition": "", "source": [], "children": [{"tag": "about", "definition": "", "source": [{"title": "Emotive and abstract", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch02s03.html"}]}]}, {"tag": "Adobe After Effects", "definition": "", "source": [], "children": [{"tag": "URL", "definition": "", "source": [{"title": "Other specialist tools", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}]}, {"tag": "Adobe Flash", "definition": "", "source": [], "children": [{"tag": "about", "definition": "", "source": [{"title": "Programming environments", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}, {"tag": "URL", "definition": "", "source": [{"title": "Programming environments", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}]}, {"tag": "Adobe Illustrator", "definition": "", "source": [], "children": [{"tag": "about", "definition": "", "source": [{"title": "Other specialist tools", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}, {"tag": "URL", "definition": "", "source": [{"title": "Other specialist tools", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}]}]