我正在尝试使用bs刮取一个表,在其中一列上,可以有多个链接或href,例如下面的示例。
<td class="column-6">
<a href="https://smallcaps.com.au/andean-mining-ipo-colombia-exploration-high-grade-copper-gold-target/" rel="noopener noreferrer" target="_blank">Article</a> /
<a href="https://www.youtube.com/watch?v=Kgew7tuLWCg" rel="noopener noreferrer" target="_blank">Video</a> /
<a href="https://andeanmining.com.au/" rel="noopener noreferrer" target="_blank">Website</a></td>
我使用下面的代码来定位它们,但这只返回第一个href,对于具有多个href的行,不返回任何其他href。
from time import sleep
import numpy as np
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 3000)
from bs4 import BeautifulSoup
# Scrape the smallcaps website for IPO Information and save into dataframe
smallcaps_URL = "https://smallcaps.com.au/upcoming-ipos/"
service = Service("C:Developmentchromedriver_win32chromedriver.exe")
chrome_options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service)
driver.get(smallcaps_URL)
sleep(3)
close_popup = driver.find_element(By.CLASS_NAME, "tve_ea_thrive_leads_form_close")
close_popup.click()
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
all_ipo_header = soup.find_all("th")
all_ipo_content = soup.find_all("td")
ipo_headers = []
ipo_contents = []
for header in all_ipo_header:
ipo_headers.append(header.text.replace(" ", "_"))
for content in all_ipo_content:
if content.a:
a = content.find('a', href=True);
ipo_contents.append(a['href'])
else:
ipo_contents.append(content.text)
# Prints complete scraped dataframe from SmallCaps website
df = pd.DataFrame(np.reshape(ipo_contents, (-1, 6)), columns=ipo_headers)
print(df)
# Next thing to do is scrape a few other websites for comparison and remove duplicates.
电流输出
Company_name ASX_code Issue_price Raise Focus Information
0 Allup Silica (TBA) APS $0.20 $5m Silica sand https://allupsilica.com/
1 Andean Mining (14 Feb) ADM $0.20 $6m Mineral exploration https://smallcaps.com.au/andean-mining-ipo-col...
2 Catalano Seafood (24 Feb) CSF $0.20 $6m Seafood https://www.catalanos.net.au/
3 Dragonfly Biosciences (TBA) DRF $0.20 $11m Cannabidiol oil https://dragonflybiosciences.com/
4 Equity Story Group (18 Mar) EQS $0.20 $5.5m Market advice & research https://equitystory.com.au/
5 Far East Gold (TBA) FEG $0.20 $12m Mineral exploration https://smallcaps.com.au/far-east-gold-asx-ipo...
6 Killi Resources (10 Feb) KLI $0.20 $6m Gold and copper https://www.killi.com.au/
7 Lukin Resources (TBA) LKN $0.20 $7.5m Mineral exploration https://smallcaps.com.au/lukin-resources-launc...
8 Many Peaks Gold (2 Mar) MPG $0.20 $5.5m Mineral exploration https://manypeaks.com.au/
9 Norfolk Metals (14 Mar) NFL $0.20 $5.5m Gold and uranium https://norfolkmetals.com.au/
10 Omnia Metals Group (21 Feb) OM1 $0.20 $5.5m Mineral exploration https://www.omniametals.com.au/
11 Pure Resources (16 Mar) PR1 $0.20 $4.6m Mineral exploration http://www.pureresources.com.au/
12 Pinnacle Minerals (11 Mar) PIM $0.20 $5.5m Kaolin - Haloysite https://pinnacleminerals.com.au/
13 Stelar Metals (7 Mar) SLB $0.20 $7m Copper and zinc https://stelarmetals.com.au/
14 Top End Energy (21 Mar) TEE $0.20 $6.4m Oil and gas http://www.topendenergy.com.au/
15 US Student Housing REIT (TBA) USQ $1.38 $45m US student accommodation https://usq-reit.com/
Process finished with exit code 0
The expected output should have three links/hrefs for some rows the 'Information' column, however it is only returning the first link/href for all of them. Could someone please guide me in the right direction?
a = content.find('a', href=True);
如果有不止一个,这可能也是一个find_all,所以:
a = content.find_all('a', href=True);
以下操作似乎有效-它将查找content.a中的所有href项,以允许在可用的情况下使用多个href。
for content in all_ipo_content:
if content.a:
all_urls = [content.get("href") for content in content.find_all('a')]
ipo_contents.append(all_urls)