美丽的汤，不包括内部<li>和<ul>父标签上的<li>标签 .getText()

好吧，我想获得的数据;

  <li class="expandable"> Criminal
    <ul class="subPracticeAreas" style="display:none">
        <li> Appellate< /li>
        <li>Crimes against the person</li>
        <li> Drugs< /li>
        <li>Environmental and planning offences</li>
        <li> Extradition< /li>
        <li>Fraud</li>
        <li> Juvenile justice</li>
        <li>Mental illness</li>
        <li> Proceeds of crime / money laundering</li>
        <li>Property offences</li>
        <li> Sexual assault</li>
        <li>Traffic</li>
        <li> White collar and corporate crime</li>
        <li>Work health and safety</li>
    </ul>
  </li>
  <li class="expandable"> Appellate
    <ul class="subPracticeAreas" style="display:none">
        <li> Civil appeals</li>
        <li>Criminal appeals</li>
    </ul>
  </li>
  <li class="expandable"> Inquests / inquiries
    <ul class="subPracticeAreas" style="display:none">
        <li> Commissions and other Inquiries</li>
        <li>Coronial inquests</li>
    </ul>
  </li>

所以我希望能够实现这些目标;

获取父级的文本，将其存储为变量(用作字典键(，例如在第一个列表中，我只想抓住"犯罪"。
抓住每个孩子的文本(单独的粗略(，将其作为键入的物品存储在键"犯罪"(如上所述(中。

RINSE并重复每个LI类="可扩展"部分。

我到目前为止拥有的东西(如您所能想象的那样，这还没有起作用(;

aop_list_headers = page_soup.findAll("li",{"class":"expandable"})
for aop_list in aop_list_headers:
    aop_key_name = aop_li_head.getText().strip()

因此，这返回了各自的父母li的所有文本(例如，对于上述循环的第一次迭代，我会得到以下;

CriminalAppellateCrimes against the personDrugsEnvironmental and planning offencesExtraditionFraudJuvenile justiceMental illnessProceeds of crime/money launderingProperty offencesSexual assaultTrafficWhite collar and corporate crimeWork health and safety

我如何阻止它遍历每个Li的文本(因为我可以看到它正在发生，因为父母Li围绕整个列表进行...

我还没有包括我将如何执行第二个目标(上述(，因为我被困在第一个目标上....

所有帮助都非常感谢。谢谢您。

您可以使用递归标签通过 find_all访问预期dict键的所有子元素：

children = soup.find_all("li", { "class" : "expandable" }, recursive=False)
for child in children:
   print child.getText()

另外，您可以获得所有 li的文本元素，其父母(ul(的父母的类别为"可扩展"

def get_children(elem):
    return (tag.name == 'li' and
        tag.parent.parent.name == 'li' and
        'expandable' in tag.parent.parent['class'])
for child in soup.find_all(get_children):
    print child.getText() #li text

我最终在像so so so so;

中使用Extend((函数

for html in html_list:
    # Storing the unwanted child element
    unwanted = html.find("ul",{"class":""subPracticeAreas""})
    # Extracting the child <ul> data
    unwanted.extract()

因此转动;

<li class="expandable"> Criminal
  <ul class="subPracticeAreas" style="display:none">
    <li> Appellate< /li>
    <li>Crimes against the person</li>
    <li> Drugs< /li>
    <li>Environmental and planning offences</li>
    <li> Extradition< /li>
    <li>Fraud</li>
    <li> Juvenile justice</li>
    <li>Mental illness</li>
    <li> Proceeds of crime / money laundering</li>
    <li>Property offences</li>
    <li> Sexual assault</li>
    <li>Traffic</li>
    <li> White collar and corporate crime</li>
    <li>Work health and safety</li>
  </ul>
</li>

进入此;

  <li class="expandable"> Criminal </li>

因此，我需要收集的父母

要完成我的原始评论中提到的两个任务，我使用了以下代码。

aop_find = page_soup.find(string=re.compile('.*{0}.*'.format(aop)), recursive=True)
if aop_find != None:
    aop_list_headers = page_soup.findAll("li",{"class":"expandable"})
    #counter
    aop_counter = 1
    #AOP prefix
    aop_prefix = "AOP "
    aop_result = {}
    # Getting the dictionary key
    for aop_li_head in aop_list_headers:
        # storing the sub practice groups data
        aop_values = aop_li_head.find("ul",{"class":"subPracticeAreas"})
        # Extracting the child <ul> data
        unwanted = aop_li_head.find("ul",{"class":"subPracticeAreas"})
        unwanted.extract()
        #key name (e.g. "Crime ")
        aop_key_name = aop_li_head.getText().strip() + " "
        aop_counter = 1
        # Finding the text in each vaule
        for aop_value in aop_values:
            aop_value = aop_value.getText().strip()
            aop_result[aop_prefix + str(aop_key_name) + str(aop_counter)] = aop_value
            aop_counter = aop_counter + 1
    # Appending loop results
    page_results.append(aop_result)

感谢所有人的输入！

欢呼

相关内容

最新更新

热门标签：