蟒蛇美汤网页抓取问题


page = requests.get("http://www.freejobalert.com/upsc-recruitment/16960/#Engg-Services2019")
c = page.content
soup=BeautifulSoup(c,"html.parser")
data=soup.find_all("tr")
for r in data:
td = r.find_all("td",{"style":"text-align: center;"})
for d in td:
link =d.find_all("a")
for li in link:
span = li.find_all("span",{"style":"color: #008000;"})
for s in span:
strong = s.find_all("strong")
for st in strong:
dict['title'] = st.text
for l in link:
dict["link"] = l['href']
print(dict)

它是给予

{'title': 'Syllabus', 'link': 'http://www.upsc.gov.in/'}
{'title': 'Syllabus', 'link': 'http://www.upsc.gov.in/'}
{'title': 'Syllabus', 'link': 'http://www.upsc.gov.in/'}

我期待:

{'title': 'Apply Online', 'link': 'https://upsconline.nic.in/mainmenu2.php'}
{'title': 'Notification', 'link': 'http://www.freejobalert.com/wp-content/uploads/2018/09/Notification-UPSC-Engg-Services-Prelims-Exam-2019.pdf'}
{'title': 'Official Website ', 'link': 'http://www.upsc.gov.in/'}

在这里我想要所有"重要链接"意味着"在线申请","通知","官方网站" 它是每个表的链接。 但它在标题中给了我"教学大纲",而不是重复的链接..

请看一下这个..

这可能会对您有所帮助,请检查下面的代码。

import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.freejobalert.com/'
'upsc-recruitment/16960/#Engg-Services2019')
c = page.content
soup = BeautifulSoup(c,"html.parser")
row = soup.find_all('tr')
dict = {}
for i in row:
for title in i.find_all('span', attrs={
'style':'color: #008000;'}):
dict['Title'] = title.text
for link in i.find_all('a', href=True):
dict['Link'] = link['href']
print(dict)

最新更新