Python - 将文本从HTML文件转换为没有唯一标识符标签的csv



我使用beautifulsoup4从网页上抓取了一些我想要的信息,该网页列出了精神科医生实践的详细信息,并设法将这一部分与关键信息一起取回。

<h5>Practice Locations</h5>
<p>Springfield, 1234<br/> 08 1234 5678</p>
<p>Shelbyville, 1234<br/>08 1234 5678</p>
<h5>Gender:</h5>
<p>Male<br/></p>
<h5>Languages spoken (other than English):</h5>
<p>Spanish<br/></p>
<p>Italian<br/></p>
<h5>Problem areas treated:</h5>
<p>Anxiety disorders<br/>Mood disorders<br/>Sexual disorders<br/></p>
<h5>Populations treated:</h5>
<p>Adult<br/>Young adult<br/></p>
<h5>Subspecialty areas:</h5>
<p>Cancer patients<br/>Gender issues<br/>Pain management<br/>Specialist psychotherapist<br/></p>
<h5>Treatments and services offered:</h5>
<p>Does not prescribe psychotropics<br/>Psychotherapy – cognitive behavioural therapy (CBT)<br/>Psychotherapy – hypnotherapy<br/>Psychotherapy – interpersonal<br/>Psychotherapy – marital therapy<br/></p>
<h5>Practice details:</h5>
<p>Can bulk bill selected patients<br/></p>
<p> </p>

我想将每个标题下的信息放入.csv文件的列中,但我无法弄清楚如何做到这一点,因为标题没有任何唯一标识符。我知道我必须使用标题以某种方式划分单独的列,但我对 python 完全陌生,不知道该怎么做。

手动操作很容易,但我想从许多以相同方式格式化的页面中收集此信息。 为了使事情变得更加复杂,某些页面缺少其中一些标题的信息(例如,它们没有列出治疗的人群或亚专业(,因此在尝试收集该信息之前,我必须检查每个标题是否存在。

任何指导将不胜感激!

您可以使用h5标签作为标头:

import re
from bs4 import BeautifulSoup as soup
import itertools
headers = [i.text for i in soup(content, 'html.parser').find_all('h5')]
full_data = [[i.text, i] for i in soup(content, 'html.parser').find_all(re.compile('h5|p'))]
new_data = [[a, list(b)] for a, b in itertools.groupby(full_data, key=lambda x:x[0] in headers)]
grouped = [new_data[i]+new_data[i+1] for i in range(0, len(new_data), 2)]
final_data = {c:{i:str(h)[3:-4].split('<br/>')[1:] for i, h in results} for [_, [[c, _]], _, results] in grouped}

输出:

{'Practice Locations': {'Springfield, 1234 08 1234 5678': [' 08 1234 5678'], 'Shelbyville, 123408 1234 5678': ['08 1234 5678']}, 'Gender:': {'Male': ['']}, 'Languages spoken (other than English):': {'Spanish': [''], 'Italian': ['']}, 'Problem areas treated:': {'Anxiety disordersMood disordersSexual disorders': ['Mood disorders', 'Sexual disorders', '']}, 'Populations treated:': {'AdultYoung adult': ['Young adult', '']}, 'Subspecialty areas:': {'Cancer patientsGender issuesPain managementSpecialist psychotherapist': ['Gender issues', 'Pain management', 'Specialist psychotherapist', '']}, 'Treatments and services offered:': {'Does not prescribe psychotropicsPsychotherapy – cognitive behavioural therapy (CBT)Psychotherapy – hypnotherapyPsychotherapy – interpersonalPsychotherapy – marital therapy': ['Psychotherapy – cognitive behavioural therapy (CBT)', 'Psychotherapy – hypnotherapy', 'Psychotherapy – interpersonal', 'Psychotherapy – marital therapy', '']}, 'Practice details:': {'Can bulk bill selected patients': [''], ' ': []}}

最新更新