不能抓取表中的href链接



我试图从表中拉出href链接,我后来需要逐个单击以访问每个链接内的数据。但我不知道怎么做。我已经尝试了find_all,并且一直得到"ResultSet对象没有属性'%s'错误。

HTML:(真的很长,所以这是一个1/10)

<thead>
<tr class="sctablehead">
<th>Academic Program</th>
<th>Departments</th>
<th>Academic Level</th>
<th>College</th>
<th>Online</th>
<th>Degree Type</th>
</tr>
</thead>
<tbody>
<tr class="even firstrow"><td><a href="/graduate/graduate-programs/master-accountancy/">Accountancy</a></td><td>Accounting</td><td>Graduate</td><td>BUS</td><td></td><td>MAC</td></tr>
<tr class="odd"><td><a href="/undergraduate/colleges-programs/college-business-administration/school-accounting-finance/bsba-in-accounting/">Accounting</a></td><td>Accounting</td><td>Undergraduate</td><td>BUS</td><td></td><td>BSB</td></tr>
<tr class="even"><td><a href="/undergraduate/colleges-programs/college-business-administration/school-accounting-finance/accounting-minor/">Accounting</a></td><td>Accounting</td><td>Undergraduate</td><td>BUS</td><td></td><td>Minor</td></tr>
<tr class="odd"><td><a href="/undergraduate/colleges-programs/college-science-technology-engineering-mathematics/department-mathematics-statistics/actuarial-science-minor/">Actuarial Science</a></td><td>Mathematics, Economics, Finance</td><td>Undergraduate</td><td>STEM</td><td></td><td>Minor</td></tr>
<tr class="even"><td><a href="/graduate/graduate-programs/post-masters-adult-gero-acute-care-nurse-pract-certificate-program/">Adult Gerontology Acute Care Nurse Practitioner</a></td><td>Nursing</td><td>Graduate</td><td>HHS</td><td></td><td>PMC</td></tr>
<tr class="odd"><td><a href="/undergraduate/colleges-programs/college-business-administration/department-marketing/advertising-public-relations/">Advertising and Public Relations</a></td><td>Advertising</td><td>Undergraduate</td><td>BUS</td><td></td><td>BSB</td></tr>
<tr class="even"><td><a href="/undergraduate/colleges-programs/college-business-administration/department-marketing/advertising-public-relations-minor/">Advertising Public Relations</a></td><td>Marketing</td><td>Undergraduate</td><td>BUS</td><td></td><td>Minor</td></tr>
<tr class="odd"><td><a href="/undergraduate/colleges-programs/college-health-human-services/aerospace-studies-program/">Aerospace Studies</a></td><td>Aerospace Studies</td><td>Undergraduate</td><td>HHS</td><td></td><td>Minor</td></tr>
<tr class="even"><td><a href="/undergraduate/colleges-programs/college-liberal-arts-social-sciences-education/department-africana-studies-minor/">Africana Studies</a></td><td>Africana Studies</td><td>Undergraduate</td><td>BCLASSE</td><td></td><td>Minor</td></tr>

…等等

我代码:

r = requests.get(driver.current_url)
soup = bs(r.content, 'html.parser')
programs_table = soup.find_all('table', {"class":"sc_sctable tbl_degrees sorttable"})
for tr in programs_table.find_all('tr class'):
for a in tr.find_all('a'):
print(a['href'])

如果您的表被正确发现(因为您没有提供html ..)然后ONLY:-

r = requests.get(driver.current_url)
soup = bs(r.content, 'html.parser')
programs_table = soup.find_all('table', {"class":"sc_sctable tbl_degrees sorttable"})
for tr in programs_table.find_all('tr'):
for a in tr.find_all('a'):
print(a['href'])

换句话说,你可以尝试programs_table.find_all("tr")而不是programs_table.find_all("tr class")

因为我使用这个后得到的结果如下:

/undergraduate/colleges-programs/college-business-administration/school-accounting-finance/bsba-in-accounting/
/undergraduate/colleges-programs/college-business-administration/school-accounting-finance/accounting-minor/
/undergraduate/colleges-programs/college-science-technology-engineering-mathematics/department-mathematics-statistics/actuarial-science-minor/
/graduate/graduate-programs/post-masters-adult-gero-acute-care-nurse-pract-certificate-program/
/undergraduate/colleges-programs/college-business-administration/department-marketing/advertising-public-relations/
/undergraduate/colleges-programs/college-business-administration/department-marketing/advertising-public-relations-minor/
/undergraduate/colleges-programs/college-health-human-services/aerospace-studies-program/```

首先,您不应该使用find_all来抓取一个标记,除非您真的希望它在列表中。因此,要获取表格你只需执行:

programs_table = soup.find('table', class_="sc_sctable")

现在要获得href链接的内部<a>标签,您可以抓取具有内部<a>标签的<td>标签:

tags_with_href = programs_table.tbody.find_all('td')
links = [each_tag.a['href'] for each_tag in tags_with_href if each_tag.a]
# -> ['/graduate/graduate-programs/master-accountancy/', ... ]

如果你想要绝对url而不是相对url,你可以定义base_url并将每个相对url添加到它:

base_url = '<base_url_of_website>'
links = [base_url + link for link in links]

最新更新