好吧,我想获得的数据;
<li class="expandable"> Criminal
<ul class="subPracticeAreas" style="display:none">
<li> Appellate< /li>
<li>Crimes against the person</li>
<li> Drugs< /li>
<li>Environmental and planning offences</li>
<li> Extradition< /li>
<li>Fraud</li>
<li> Juvenile justice</li>
<li>Mental illness</li>
<li> Proceeds of crime / money laundering</li>
<li>Property offences</li>
<li> Sexual assault</li>
<li>Traffic</li>
<li> White collar and corporate crime</li>
<li>Work health and safety</li>
</ul>
</li>
<li class="expandable"> Appellate
<ul class="subPracticeAreas" style="display:none">
<li> Civil appeals</li>
<li>Criminal appeals</li>
</ul>
</li>
<li class="expandable"> Inquests / inquiries
<ul class="subPracticeAreas" style="display:none">
<li> Commissions and other Inquiries</li>
<li>Coronial inquests</li>
</ul>
</li>
所以我希望能够实现这些目标;
- 获取父级的文本,将其存储为变量(用作字典键(,例如在第一个列表中,我只想抓住"犯罪"。
- 抓住每个孩子的文本(单独的粗略(,将其作为键入的物品存储在键"犯罪"(如上所述(中。
RINSE并重复每个LI类="可扩展"部分。
我到目前为止拥有的东西(如您所能想象的那样,这还没有起作用(;
aop_list_headers = page_soup.findAll("li",{"class":"expandable"})
for aop_list in aop_list_headers:
aop_key_name = aop_li_head.getText().strip()
因此,这返回了各自的父母li的所有文本(例如,对于上述循环的第一次迭代,我会得到以下;
CriminalAppellateCrimes against the personDrugsEnvironmental and planning offencesExtraditionFraudJuvenile justiceMental illnessProceeds of crime/money launderingProperty offencesSexual assaultTrafficWhite collar and corporate crimeWork health and safety
我如何阻止它遍历每个Li的文本(因为我可以看到它正在发生,因为父母Li围绕整个列表进行...
我还没有包括我将如何执行第二个目标(上述(,因为我被困在第一个目标上....
所有帮助都非常感谢。谢谢您。
您可以使用递归标签通过 find_all
访问预期dict键的所有子元素:
children = soup.find_all("li", { "class" : "expandable" }, recursive=False)
for child in children:
print child.getText()
另外,您可以获得所有 li
的文本元素,其父母(ul(的父母的类别为"可扩展"
def get_children(elem):
return (tag.name == 'li' and
tag.parent.parent.name == 'li' and
'expandable' in tag.parent.parent['class'])
for child in soup.find_all(get_children):
print child.getText() #li text
我最终在像so so so so;
中使用Extend((函数for html in html_list:
# Storing the unwanted child element
unwanted = html.find("ul",{"class":""subPracticeAreas""})
# Extracting the child <ul> data
unwanted.extract()
因此转动;
<li class="expandable"> Criminal
<ul class="subPracticeAreas" style="display:none">
<li> Appellate< /li>
<li>Crimes against the person</li>
<li> Drugs< /li>
<li>Environmental and planning offences</li>
<li> Extradition< /li>
<li>Fraud</li>
<li> Juvenile justice</li>
<li>Mental illness</li>
<li> Proceeds of crime / money laundering</li>
<li>Property offences</li>
<li> Sexual assault</li>
<li>Traffic</li>
<li> White collar and corporate crime</li>
<li>Work health and safety</li>
</ul>
</li>
进入此;
<li class="expandable"> Criminal </li>
因此,我需要收集的父母
要完成我的原始评论中提到的两个任务,我使用了以下代码。
aop_find = page_soup.find(string=re.compile('.*{0}.*'.format(aop)), recursive=True)
if aop_find != None:
aop_list_headers = page_soup.findAll("li",{"class":"expandable"})
#counter
aop_counter = 1
#AOP prefix
aop_prefix = "AOP "
aop_result = {}
# Getting the dictionary key
for aop_li_head in aop_list_headers:
# storing the sub practice groups data
aop_values = aop_li_head.find("ul",{"class":"subPracticeAreas"})
# Extracting the child <ul> data
unwanted = aop_li_head.find("ul",{"class":"subPracticeAreas"})
unwanted.extract()
#key name (e.g. "Crime ")
aop_key_name = aop_li_head.getText().strip() + " "
aop_counter = 1
# Finding the text in each vaule
for aop_value in aop_values:
aop_value = aop_value.getText().strip()
aop_result[aop_prefix + str(aop_key_name) + str(aop_counter)] = aop_value
aop_counter = aop_counter + 1
# Appending loop results
page_results.append(aop_result)
感谢所有人的输入!
欢呼