使用bs报废href只返回第一个链接



我正在尝试使用bs刮取一个表,在其中一列上,可以有多个链接或href,例如下面的示例。

<td class="column-6">
<a href="https://smallcaps.com.au/andean-mining-ipo-colombia-exploration-high-grade-copper-gold-target/" rel="noopener noreferrer" target="_blank">Article</a> / 
<a href="https://www.youtube.com/watch?v=Kgew7tuLWCg" rel="noopener noreferrer" target="_blank">Video</a> / 
<a href="https://andeanmining.com.au/" rel="noopener noreferrer" target="_blank">Website</a></td>

我使用下面的代码来定位它们,但这只返回第一个href,对于具有多个href的行,不返回任何其他href。

from time import sleep
import numpy as np
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 3000)
from bs4 import BeautifulSoup
# Scrape the smallcaps website for IPO Information and save into dataframe
smallcaps_URL = "https://smallcaps.com.au/upcoming-ipos/"
service = Service("C:Developmentchromedriver_win32chromedriver.exe")
chrome_options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service)
driver.get(smallcaps_URL)
sleep(3)
close_popup = driver.find_element(By.CLASS_NAME, "tve_ea_thrive_leads_form_close")
close_popup.click()
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
all_ipo_header = soup.find_all("th")
all_ipo_content = soup.find_all("td")
ipo_headers = []
ipo_contents = []
for header in all_ipo_header:
ipo_headers.append(header.text.replace(" ", "_"))
for content in all_ipo_content:
if content.a:
a = content.find('a', href=True);
ipo_contents.append(a['href'])
else:
ipo_contents.append(content.text)
# Prints complete scraped dataframe from SmallCaps website
df = pd.DataFrame(np.reshape(ipo_contents, (-1, 6)), columns=ipo_headers)
print(df)
# Next thing to do is scrape a few other websites for comparison and remove duplicates.

电流输出

Company_name ASX_code Issue_price  Raise                     Focus                                        Information
0              Allup Silica (TBA)      APS       $0.20    $5m               Silica sand                           https://allupsilica.com/
1          Andean Mining (14 Feb)      ADM       $0.20    $6m       Mineral exploration  https://smallcaps.com.au/andean-mining-ipo-col...
2       Catalano Seafood (24 Feb)      CSF       $0.20    $6m                   Seafood                      https://www.catalanos.net.au/
3     Dragonfly Biosciences (TBA)      DRF       $0.20   $11m           Cannabidiol oil                  https://dragonflybiosciences.com/
4     Equity Story Group (18 Mar)      EQS       $0.20  $5.5m  Market advice & research                        https://equitystory.com.au/
5             Far East Gold (TBA)      FEG       $0.20   $12m       Mineral exploration  https://smallcaps.com.au/far-east-gold-asx-ipo...
6        Killi Resources (10 Feb)      KLI       $0.20    $6m           Gold and copper                          https://www.killi.com.au/
7           Lukin Resources (TBA)      LKN       $0.20  $7.5m       Mineral exploration  https://smallcaps.com.au/lukin-resources-launc...
8         Many Peaks Gold (2 Mar)      MPG       $0.20  $5.5m       Mineral exploration                          https://manypeaks.com.au/
9         Norfolk Metals (14 Mar)      NFL       $0.20  $5.5m          Gold and uranium                      https://norfolkmetals.com.au/
10    Omnia Metals Group (21 Feb)      OM1       $0.20  $5.5m       Mineral exploration                    https://www.omniametals.com.au/
11        Pure Resources (16 Mar)      PR1       $0.20  $4.6m       Mineral exploration                   http://www.pureresources.com.au/
12     Pinnacle Minerals (11 Mar)      PIM       $0.20  $5.5m        Kaolin - Haloysite                   https://pinnacleminerals.com.au/
13          Stelar Metals (7 Mar)      SLB       $0.20    $7m           Copper and zinc                       https://stelarmetals.com.au/
14        Top End Energy (21 Mar)      TEE       $0.20  $6.4m               Oil and gas                    http://www.topendenergy.com.au/
15  US Student Housing REIT (TBA)      USQ       $1.38   $45m  US student accommodation                              https://usq-reit.com/
Process finished with exit code 0
The expected output should have three links/hrefs for some rows the 'Information' column, however it is only returning the first link/href for all of them. Could someone please guide me in the right direction?
a = content.find('a', href=True);

如果有不止一个,这可能也是一个find_all,所以:

a = content.find_all('a', href=True);

以下操作似乎有效-它将查找content.a中的所有href项,以允许在可用的情况下使用多个href。

for content in all_ipo_content:
if content.a:
all_urls = [content.get("href") for content in content.find_all('a')]
ipo_contents.append(all_urls)

最新更新