我有内容:
<p><a href="/dms_pubrec/itu-t/rec/q/T-REC-Q.1238.3-200006-I!!TOC-TXT-E.txt" target="_blank"><strong><font size="1">Table of Contents </font></strong></a></p>
</td>
</tr>
<tr>
<td width="80%"> </td>
<td align="right" bgcolor="#FFFF80" style="font-size: 9pt;">
<p><a href="./htmldoc.asp?doc=trecqT-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt" target="_blank"><strong><font size="1">Summary </font></strong></a></p>
</td>
</tr>
<tr>
<td colspan="2" style="font-size: 9pt;color: red;">
<p>This Recommendation includes an electronic attachment containing the ASN.1 definitions for the IN CS-3 SCF-SRF interface</p>
</td>
</tr>
我想提取:
- text from following href and
- 链接,以下文本摘要
<a href="./htmldoc.asp?doc=trecqT-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt" target="_blank"><strong><font size="1">Summary </font></strong></a>
我代码:
import requests
from bs4 import BeautifulSoup
url = "https://www.itu.int/rec/T-REC-Q.1238.3-200006-I/en"
q = requests.get(url)
result = q.content
soup = BeautifulSoup(result, 'html.parser')
您想要拉出与文本Summary
:
import requests
from bs4 import BeautifulSoup
url = "https://www.itu.int/rec/T-REC-Q.1238.3-200006-I/en"
q = requests.get(url)
result = q.content
soup = BeautifulSoup(result, 'html.parser')
link= soup.select_one('a:-soup-contains("Summary")').get('href')
print('https://www.itu.int/rec/T-REC-Q.1238.3-200006-I'+link)
输出:
https://www.itu.int/rec/T-REC-Q.1238.3-200006-I./htmldoc.asp?doc=trecqT-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt
如果您想在<a>
标签中获得内容和href
链接,您可以使用find_all
循环内容,如下所示:
for a in soup.find_all('a', href=True):
return a.contents, a['href']