使用美丽的汤在标签中获取特定字符串<div>


标签。

我有一个标签列表,我提取的:

soup.findAll('div', {'class': 'formelement'}):

输出是:

[<div class="formelement">
<label class="libelle" for="field_tit">Etat :</label>
Publié              </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Type de produit :</label>
Plaque de plâtre                </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Numéro :</label>
PP/48-05                                    </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Titulaire :</label>
CIA ESPAÑOLA DE AISLAMIENTOS SA             </div>,
<div class="formelement">
<label class="libelle" for="field_ref">Usine :</label>
39              </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Date d'admission :</label>
13/07/2017                      </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Date de reconduction :</label>
04/02/2021                      </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Date de fin de validité :</label>
04/05/2022                      </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Certificat PDF :</label>
<a href="application/docs/certificats/
for div in soup.findAll('div', {'class': 'formelement'}):
product_data[div.text] = div.next_sibling
8_05.pdf" target="_blank"> <img src="public/images/pdf.gif" title="Télécharger le certificat au format PDF"/> </a> </div>]

我的目标是有一个字典:

product_data = {
"Numéro": "PP/48-05",
"Titulaire": "CIA ESPAÑOLA DE AISLAMIENTOS SA",
"Usine": "39",
"Date de fin de validité": "04/05/2022",
"Certificat PDF": "application/docs/certificats/PP_48_05.pdf"
}

I tried with

PP_4,但它将在标签内的所有字符串(显然),并没有找到任何方法分别获得div内的两个字符串。如何分别得到这些字符串?

我希望我的问题足够明确。

可以销毁/分解内部标签

from bs4 import BeautifulSoup
html="""
<div class="formelement">
<label class="libelle" for="field_tit">Etat :</label>
Publié              </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Type de produit :</label>
Plaque de plâtre                </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Numéro :</label>
PP/48-05                                    </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Titulaire :</label>
CIA ESPAÑOLA DE AISLAMIENTOS SA             </div>,
<div class="formelement">
<label class="libelle" for="field_ref">Usine :</label>
39              </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Date d'admission :</label>
13/07/2017                      </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Date de reconduction :</label>
04/02/2021                      </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Date de fin de validité :</label>
04/05/2022                      </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Certificat PDF :</label>
<a href="application/docs/certificats/PP_48_05.pdf" target="_blank">
<img src="public/images/pdf.gif" title="Télécharger le certificat au format PDF"/>
</a>
</div>"""
soup = BeautifulSoup(html, 'html.parser')
data = {}
for div in soup.findAll('div', {'class': 'formelement'}):
label = div.find('label')
key = label.text[:-2]
label.decompose()
try:
value = div.find('a').get('href')
except AttributeError:
value = div.text.strip()
data[key] = value
print(data)

输出
{'Etat': 'Publié', 'Type de produit': 'Plaque de plâtre',
'Numéro': 'PP/48-05', 'Titulaire': 'CIA ESPAÑOLA DE AISLAMIENTOS SA', 
'Usine': '39', "Date d'admission": '13/07/2017', 
'Date de reconduction': '04/02/2021', 'Date de fin de validité': '04/05/2022', 
'Certificat PDF': 'application/docs/certificats/PP_48_05.pdf'}

尝试:

from bs4 import BeautifulSoup
html_doc = """
<div class="formelement">
<label class="libelle" for="field_tit">Etat :</label>
Publié              </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Type de produit :</label>
Plaque de plâtre                </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Numéro :</label>
PP/48-05                                    </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Titulaire :</label>
CIA ESPAÑOLA DE AISLAMIENTOS SA             </div>,
<div class="formelement">
<label class="libelle" for="field_ref">Usine :</label>
39              </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Date d'admission :</label>
13/07/2017                      </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Date de reconduction :</label>
04/02/2021                      </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Date de fin de validité :</label>
04/05/2022                      </div>,
<div class="formelement">
<label class="libelle" for="field_tit">Certificat PDF :</label>
<a href="application/docs/certificats/PP_48_05.pdf" target="_blank">
<img src="public/images/pdf.gif" title="Télécharger le certificat au format PDF"/>
</a>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
allowed_keys = {
"Numéro",
"Titulaire",
"Usine",
"Date de fin de validité",
"Certificat PDF",
}
data = []
for f in soup.select(".formelement"):
key_value = f.get_text(strip=True, separator="|").split("|")
if len(key_value) == 1:
a = f.find("a")
if a:
key_value = [key_value[0], a["href"]]
else:
continue
key_value[0] = key_value[0].strip(" :")
if key_value[0] not in allowed_keys:
continue
data.append(key_value)

out = dict(data)
print(out)

打印:

{
"Numéro": "PP/48-05",
"Titulaire": "CIA ESPAÑOLA DE AISLAMIENTOS SA",
"Usine": "39",
"Date de fin de validité": "04/05/2022",
"Certificat PDF": "application/docs/certificats/PP_48_05.pdf",
}

最新更新