如何使用bs4从html代码中获取特定项目



我有以下HTML代码。我想转换下面的HTML代码:

<div class="company_data__list">
<div class="company_data__row"><div class="company_data__head">Name</div><div class="company_data__data">ABC Company<br/>Subtitle</div></div>
<div class="company_data__row"><div class="company_data__head">Capital</div><div class="company_data__data">230000</div></div>
<div class="company_data__row"><div class="company_data__head">Total</div><div class="company_data__data">103</div></div>
<div class="company_data__row"><div class="company_data__head">Name</div><div class="company_data__data">XYZ Company<br/>Subtitle</div> 
<div class="company_data__row"><div class="company_data__head">Total</div><div class="company_data__data">10</div></div>
<div class="company_data__row"><div class="company_data__head">Name</div><div class="company_data__data">CAT Company<br/>Subtitle</div></div>
<div class="company_data__row"><div class="company_data__head">Capital</div><div class="company_data__data">430000</div></div>
<div class="company_data__row"><div class="company_data__head">Total</div><div class="company_data__data">10233</div></div>
<div class="company_data__row"><div class="company_data__head">URL</div><div class="company_data__data">www.abc.com</div></div>
</div>

转换成Json文件,如下所示:

{ id: '1',
data:{
name: 'ABC CAT Company',
capital: '230000',
total:'103'
},
id:'2',
data: {
name: 'XYZ CAT Company',
total:'10'
},
id:'3',
data: {
name: 'CAT Company',
capital: '430000',
total:'10',
url:'www.abc.com'
},

}

我使用的是python3、bs4、re(正则表达式(

这是一种方法。

例如:

import csv
from bs4 import BeautifulSoup
html = """<div class="company_data__list">
<div class="company_data__row"><div class="company_data__head">Name</div><div class="company_data__data">ABC Company<br/>Subtitle</div></div>
<div class="company_data__row"><div class="company_data__head">Capital</div><div class="company_data__data">230000</div></div>
<div class="company_data__row"><div class="company_data__head">Total</div><div class="company_data__data">103</div></div>
<div class="company_data__row"><div class="company_data__head">Name</div><div class="company_data__data">XYZ Company<br/>Subtitle</div></div>
<div class="company_data__row"><div class="company_data__head">Capital</div><div class="company_data__data">330000</div></div>
<div class="company_data__row"><div class="company_data__head">Total</div><div class="company_data__data">10</div></div>
<div class="company_data__row"><div class="company_data__head">Name</div><div class="company_data__data">CAT Company<br/>Subtitle</div></div>
<div class="company_data__row"><div class="company_data__head">Capital</div><div class="company_data__data">430000</div></div>
<div class="company_data__row"><div class="company_data__head">Total</div><div class="company_data__data">10233</div></div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
content = soup.find("div", class_="company_data__list").find_all("div", class_='company_data__data') #Find required DIV
with open(filename, "w") as csv_file:       #Open File
writer = csv.writer(csv_file)           #Create CSV object
for i in range(0, len(content), 3):
temp = [j.text for j in content[i:i+3]]
writer.writerow(temp)               #Write Content

最新更新