美丽的汤,不包括内部<li>和<ul>父标签上的<li>标签 .getText()



好吧,我想获得的数据;

  <li class="expandable"> Criminal
    <ul class="subPracticeAreas" style="display:none">
        <li> Appellate< /li>
        <li>Crimes against the person</li>
        <li> Drugs< /li>
        <li>Environmental and planning offences</li>
        <li> Extradition< /li>
        <li>Fraud</li>
        <li> Juvenile justice</li>
        <li>Mental illness</li>
        <li> Proceeds of crime / money laundering</li>
        <li>Property offences</li>
        <li> Sexual assault</li>
        <li>Traffic</li>
        <li> White collar and corporate crime</li>
        <li>Work health and safety</li>
    </ul>
  </li>
  <li class="expandable"> Appellate
    <ul class="subPracticeAreas" style="display:none">
        <li> Civil appeals</li>
        <li>Criminal appeals</li>
    </ul>
  </li>
  <li class="expandable"> Inquests / inquiries
    <ul class="subPracticeAreas" style="display:none">
        <li> Commissions and other Inquiries</li>
        <li>Coronial inquests</li>
    </ul>
  </li>

所以我希望能够实现这些目标;

  1. 获取父级的文本,将其存储为变量(用作字典键(,例如在第一个列表中,我只想抓住"犯罪"。
  2. 抓住每个孩子的文本(单独的粗略(,将其作为键入的物品存储在键"犯罪"(如上所述(中。

RINSE并重复每个LI类="可扩展"部分。

我到目前为止拥有的东西(如您所能想象的那样,这还没有起作用(;

aop_list_headers = page_soup.findAll("li",{"class":"expandable"})
for aop_list in aop_list_headers:
    aop_key_name = aop_li_head.getText().strip()

因此,这返回了各自的父母li的所有文本(例如,对于上述循环的第一次迭代,我会得到以下;

CriminalAppellateCrimes against the personDrugsEnvironmental and planning offencesExtraditionFraudJuvenile justiceMental illnessProceeds of crime/money launderingProperty offencesSexual assaultTrafficWhite collar and corporate crimeWork health and safety

我如何阻止它遍历每个Li的文本(因为我可以看到它正在发生,因为父母Li围绕整个列表进行...

我还没有包括我将如何执行第二个目标(上述(,因为我被困在第一个目标上....

所有帮助都非常感谢。谢谢您。

您可以使用递归标签通过 find_all访问预期dict键的所有子元素:

children = soup.find_all("li", { "class" : "expandable" }, recursive=False)
for child in children:
   print child.getText()

另外,您可以获得所有 li的文本元素,其父母(ul(的父母的类别为"可扩展"

def get_children(elem):
    return (tag.name == 'li' and
        tag.parent.parent.name == 'li' and
        'expandable' in tag.parent.parent['class'])
for child in soup.find_all(get_children):
    print child.getText() #li text

我最终在像so so so so;

中使用Extend((函数
for html in html_list:
    # Storing the unwanted child element
    unwanted = html.find("ul",{"class":""subPracticeAreas""})
    # Extracting the child <ul> data
    unwanted.extract()

因此转动;

<li class="expandable"> Criminal
  <ul class="subPracticeAreas" style="display:none">
    <li> Appellate< /li>
    <li>Crimes against the person</li>
    <li> Drugs< /li>
    <li>Environmental and planning offences</li>
    <li> Extradition< /li>
    <li>Fraud</li>
    <li> Juvenile justice</li>
    <li>Mental illness</li>
    <li> Proceeds of crime / money laundering</li>
    <li>Property offences</li>
    <li> Sexual assault</li>
    <li>Traffic</li>
    <li> White collar and corporate crime</li>
    <li>Work health and safety</li>
  </ul>
</li>

进入此;

  <li class="expandable"> Criminal </li>

因此,我需要收集的父母

  • 要完成我的原始评论中提到的两个任务,我使用了以下代码。

    aop_find = page_soup.find(string=re.compile('.*{0}.*'.format(aop)), recursive=True)
    if aop_find != None:
        aop_list_headers = page_soup.findAll("li",{"class":"expandable"})
        #counter
        aop_counter = 1
        #AOP prefix
        aop_prefix = "AOP "
        aop_result = {}
        # Getting the dictionary key
        for aop_li_head in aop_list_headers:
            # storing the sub practice groups data
            aop_values = aop_li_head.find("ul",{"class":"subPracticeAreas"})
            # Extracting the child <ul> data
            unwanted = aop_li_head.find("ul",{"class":"subPracticeAreas"})
            unwanted.extract()
            #key name (e.g. "Crime ")
            aop_key_name = aop_li_head.getText().strip() + " "
            aop_counter = 1
            # Finding the text in each vaule
            for aop_value in aop_values:
                aop_value = aop_value.getText().strip()
                aop_result[aop_prefix + str(aop_key_name) + str(aop_counter)] = aop_value
                aop_counter = aop_counter + 1
        # Appending loop results
        page_results.append(aop_result)
    

    感谢所有人的输入!

    欢呼

  • 最新更新