使用pyhton BeautifulSoup将HTML提取到JSON中



问题

我试图解析一些HTML块,将相关数据存储在JSON对象中,但我很难理解BeautifulSoup对子标签的处理方式与我的特定需求相冲突。

样本输入:

<p>Here's a paragraph</p>
<ul>
<li>With a list</li>
<li>
<ul>
<li>And a nested list</li>
<li>Within it that has some <strong>bold text</strong></li>
</ul>
</li>
</ul>

期望输出:

[
{
"type":"p",
"content":"Here's a paragraph"
},
{
"type":"ul",
"content":[
{
"type":"li",
"content":"With a list"
},
{
"type":"li",
"content":[
{
"type":"ul",
"content":[
{
"type":"li",
"content":"And a nested list"
},
{
"type":"li",
"content":"Within it that has some bold text"
}
]
}
]
}
]
}
]

我的尝试

这是我迄今为止最好的尝试:

from bs4 import BeautifulSoup
import json
def process(html):
content = []
soup = BeautifulSoup(html, 'html.parser')
elements = soup.descendants
for element in elements:
if str(element).strip() not in [' ', '']:
if element.name in ['p']:#, 'ul', 'ol', 'li']:
content.append({
'type':element.name,
'content':element.find(text=True, recursive=False)
})
elif element.name in ['ul', 'ol']:
parent = {
'type':element.name,
'content':[]
}
for child in element.children:
if child != 'n':
if child.find(text=True, recursive=False) != 'n':
parent['content'].append({
'type':child.name,
'content':child.find(text=True, recursive=False)
})
content.append(parent)
print(json.dumps(content, indent=4))
if __name__ == '__main__':
original = '''<p>Here's a paragraph</p>
<ul>
<li>With a list</li>
<li>
<ul>
<li>And a nested list</li>
<li>Within it that has some <strong>bold text</strong></li>
</ul>
</li>
</ul>
'''
process(original)

它产生以下输出:

[
{
"type": "p",
"content": "Here's a paragraph"
},
{
"type": "ul",
"content": [
{
"type": "li",
"content": "With a list"
}
]
},
{
"type": "ul",
"content": [
{
"type": "li",
"content": "And a nested list"
},
{
"type": "li",
"content": "Within it that has some "
}
]
},
{
"type": "ul",
"content": [
{
"type": "li",
"content": "And a nested list"
},
{
"type": "li",
"content": "Within it that has some "
}
]
}
]

你可以看到我有三个问题:

  1. 内部列表出现两次
  2. 内部列表未嵌套在其父列表中
  3. 标记中包含的文本丢失

我知道对HTML做这件事有点奇怪,但对如何解决这三点有什么建议吗?

这不是一个漂亮的组解决方案,但也许使用基于事件的解析器(如lxml.etree.iterparse()(会更容易

您可以注册开始/结束(打开标记/关闭标记(事件,这是处理父/子嵌套的一种有用方法。

import io, json, lxml.etree
def process(html):
# convert html str into fileobj for iterparse
html = io.BytesIO(html.encode('utf-8'))
parser = lxml.etree.iterparse(
html, events=('start', 'end'), html=True)
root = None
parents = []
for event, tag in parser:
if event == 'start':
content = []
if tag.text and tag.text.strip():
content.append(tag.text.strip())
child = dict(type=tag.tag, content=content)
parents.append(child)
if not root:
root = child
else: 
# close </tag> - point child to parent
if len(parents) > 1:
parent, child = parents[-2:]
parent['content'].append(child)
child = parents.pop()
content = child['content']
# unwrap 1 element lists that contain a text only node
if len(content) == 1 and isinstance(content[0], str):
child['content'] = content.pop()
# If the previous element is also a text only node
# join text together and "discard" the "dict"
if len(parent['content']) > 1 and 
isinstance(parent['content'][-2], str):
parent['content'][-2] += ' ' + child['content']
parent['content'].pop()
#root = root['content'][0]['content']
print(json.dumps(root, indent=4))

iterparse添加<html><body>标记-如果您想排除它们,可以使用root = root['content'][0]['content']左右。

输出:

{
"type": "html",
"content": [
{
"type": "body",
"content": [
{
"type": "p",
"content": "Here's a paragraph"
},
{
"type": "ul",
"content": [
{
"type": "li",
"content": "With a list"
},
{
"type": "li",
"content": [
{
"type": "ul",
"content": [
{
"type": "li",
"content": "And a nested list"
},
{
"type": "li",
"content": "Within it that has some bold text"
}
]
}
]
}
]
}
]
}
]
}

最新更新