如何解析文本链接?



我有内容:

<p><a href="/dms_pubrec/itu-t/rec/q/T-REC-Q.1238.3-200006-I!!TOC-TXT-E.txt" target="_blank"><strong><font size="1">Table of Contents </font></strong></a></p>
</td>
</tr>
<tr>
<td width="80%">   </td>
<td align="right" bgcolor="#FFFF80" style="font-size: 9pt;">
<p><a href="./htmldoc.asp?doc=trecqT-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt" target="_blank"><strong><font size="1">Summary </font></strong></a></p>
</td>
</tr>
<tr>
<td colspan="2" style="font-size: 9pt;color: red;">
<p>This Recommendation includes an electronic attachment containing the ASN.1 definitions for the IN CS-3 SCF-SRF interface</p>
</td>
</tr>

我想提取:

  • text from following href and
  • 链接,以下文本摘要
<a href="./htmldoc.asp?doc=trecqT-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt" target="_blank"><strong><font size="1">Summary </font></strong></a>

我代码:

import requests
from bs4 import BeautifulSoup
url = "https://www.itu.int/rec/T-REC-Q.1238.3-200006-I/en"
q = requests.get(url)
result = q.content
soup = BeautifulSoup(result, 'html.parser')

您想要拉出与文本Summary:

相关联的url
import requests
from bs4 import BeautifulSoup
url = "https://www.itu.int/rec/T-REC-Q.1238.3-200006-I/en"
q = requests.get(url)
result = q.content
soup = BeautifulSoup(result, 'html.parser')
link= soup.select_one('a:-soup-contains("Summary")').get('href')
print('https://www.itu.int/rec/T-REC-Q.1238.3-200006-I'+link)

输出:

https://www.itu.int/rec/T-REC-Q.1238.3-200006-I./htmldoc.asp?doc=trecqT-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt

如果您想在<a>标签中获得内容和href链接,您可以使用find_all循环内容,如下所示:

for a in soup.find_all('a', href=True):
return a.contents, a['href']

相关内容

  • 没有找到相关文章

最新更新