我想在rss feed描述标签中获取图像链接。
使用feedParser在不符号标签中获得了值。但是我想在该标签内获取图像链接。
<description><![CDATA[<div class="K2FeedImage"><img src="https://srilankamirror.com/media/k2/items/cache/25a3bb259efa21fc96901ad625f3a85d_S.jpg" alt="MP Piyasena sentenced to 4 years in prison" /></div><div class="K2FeedIntroText"><p>Former Tamil National Alliance (TNA) parliamentarian, P. Piyasena has been sentenced to 4 years in prison and fined Rs.</p>
</div><div class="K2FeedFullText">
<p>5.4 million for using state-owned vehicle for an year after losing his parliamentary seat.</p></div>]]></description>
然后我尝试使用Python中的子字符串。
import re
text = "<![CDATA[<img src='https://adaderanaenglish.s3.amazonaws.com/' width='60' align='left' hspace='5'/>Former Tamil National Alliance (TNA) MP P. Piyasena had been sentenced to 4 years in prison over a case of misusing a state vehicle after losing his MP post. MORE..]]>"
match = re.search("<img src="(.+?) "", text, flags=re.IGNORECASE)
try:
result = match.group(1)
except:
result = "no match found"
print(result)
c:/users/asus/desktop/untitled/a.py
找不到匹配
用退出代码0
完成的过程
您可以在没有正则表达式的情况下获取图像链接。
from bs4 import BeautifulSoup
data='''<description><![CDATA[<div class="K2FeedImage"><img src="https://srilankamirror.com/media/k2/items/cache/25a3bb259efa21fc96901ad625f3a85d_S.jpg" alt="MP Piyasena sentenced to 4 years in prison" /></div><div class="K2FeedIntroText"><p>Former Tamil National Alliance (TNA) parliamentarian, P. Piyasena has been sentenced to 4 years in prison and fined Rs.</p>
</div><div class="K2FeedFullText">
<p>5.4 million for using state-owned vehicle for an year after losing his parliamentary seat.</p></div>]]></description>'''
soup=BeautifulSoup(data,'html.parser')
item=soup.find('description')
data1=item.next_element
soup1=BeautifulSoup(data1,'html.parser')
print(soup1.find('img')['src'])
输出:
https://srilankamirror.com/media/k2/items/cache/25a3bb259efa21fc96901ad625f3a85d_S.jpg
您需要稍微更改正则表达式才能工作。您想要的是在src=
之后立即获取内容,并在符合'
字符时立即停止(懒惰搜索(。因此,您的正则应为:
match = re.search("src='+(.*?)'",text)
您可以访问此处以帮助您处理正则。
您也可以使用拆分。正如您在问题中提到的那样,这完全取决于您已经隔离了正确的标签。因此,您正在使用text
。
text = '''
<description><![CDATA[<div class="K2FeedImage"><img src="https://srilankamirror.com/media/k2/items/cache/25a3bb259efa21fc96901ad625f3a85d_S.jpg" alt="MP Piyasena sentenced to 4 years in prison" /></div><div class="K2FeedIntroText"><p>Former Tamil National Alliance (TNA) parliamentarian, P. Piyasena has been sentenced to 4 years in prison and fined Rs.</p>
</div><div class="K2FeedFullText">
<p>5.4 million for using state-owned vehicle for an year after losing his parliamentary seat.</p></div>]]></description>
'''
link = text.split('src="')[1].split('"')[0]
print(link)