<li> 使用美丽汤解析复杂标签



我有一个包含以下代码的网页:

<li> 
<a href="/wiki/Thalassery" title="Thalassery">Thalassery</a> (<a class="mw-redirect" href="/wiki/Malayalam_language" title="Malayalam language">Malayalam</a>: <span lang="ml">തലശ്ശേരി</span>), from 
<i>Tellicherry</i></li>
<li><a href="/wiki/Thanjavur" title="Thanjavur">Thanjavur</a> (<a href="/wiki/Tamil_language" title="Tamil language">Tamil</a>: <span lang="ta">தஞ்சாவூர்</span>), from British name <i>Tanjore</i></li>
<li><a href="/wiki/Thane" title="Thane">Thane</a> (<a href="/wiki/Marathi_language" title="Marathi language">Marathi</a>: <span lang="mr">ठाणे</span>), from British name <i>Tannah</i></li>
<li><a href="/wiki/Thoothukudi" title="Thoothukudi">Thoothukudi</a> (<a href="/wiki/Tamil_language" title="Tamil language">Tamil</a>: <span lang="ta">தூத்துக்குடி</span>), from <i>Tuticorin</i> and its short form <i>Tuty</i></li>

我需要解析输出,以便结果将提取以下单词:Thalassery,Tellicherry,Thanjavur,Tanjore,Thane,Tannah,Thoothukudi,Tuticorin

任何人都可以帮忙吗

您可以使用.findAll()获取所有li元素并使用find()'a''i'标签

for item in soup.findAll('li'):
print(item.find('a').text,item.find('i').text)
>>>
Thalassery Tellicherry
Thanjavur Tanjore
Thane Tannah
Thoothukudi Tuticorin

尝试simplified_scrapy的解决方案,它的容错能力

from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''
<li> 
<a href="/wiki/Thalassery" title="Thalassery">Thalassery</a> (<a class="mw-redirect" href="/wiki/Malayalam_language" title="Malayalam language">Malayalam</a>: <span lang="ml">തലശ്ശേരി</span>), from 
<i>Tellicherry</i></li>
<li><a href="/wiki/Thanjavur" title="Thanjavur">Thanjavur</a> (<a href="/wiki/Tamil_language" title="Tamil language">Tamil</a>: <span lang="ta">தஞ்சாவூர்</span>), from British name <i>Tanjore</i></li>
<li><a href="/wiki/Thane" title="Thane">Thane</a> (<a href="/wiki/Marathi_language" title="Marathi language">Marathi</a>: <span lang="mr">ठाणे</span>), from British name <i>Tannah</i></li>
<li><a href="/wiki/Thoothukudi" title="Thoothukudi">Thoothukudi</a> (<a href="/wiki/Tamil_language" title="Tamil language">Tamil</a>: <span lang="ta">தூத்துக்குடி</span>), from <i>Tuticorin</i> and its short form <i>Tuty</i></li>
'''
doc = SimplifiedDoc(html)
lis = doc.lis
print ([(li.a.text,li.i.text if li.i else '') for li in lis])

结果:

[('Thalassery', 'Tellicherry'), ('Thanjavur', 'Tanjore'), ('Thane', 'Tannah'), ('Thoothukudi', 'Tuticorin')]

最新更新