我目前正试图从这个网站"https://racing.hkjc.com/racing/information/English/racing/RaceCard.aspx?RaceDate=2023/04/06&Racecourse=HV&RaceNo=1"刮表,然后点击马的名字,这将导致我们到一个新的链接,并刮表在那里。
这是我目前拥有的代码。这只是第一匹马的测试代码。(有些导入是为了将来的东西)
import pandas as pd
import xlsxwriter
from bs4 import BeautifulSoup
from playwright.sync_api import Playwright, sync_playwright, expect
import xlwings as xw
def scrape_ranking(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url)
page.click('text="AI ONE"') #the link that will lead us to the horse info
html = page.content()
browser.close()
tables = pd.read_html(html)
df = tables[0]
df.to_excel("hkjc.xlsx", index=False)
url_1 = ('https://racing.hkjc.com/racing/information/English/racing/RaceCard.aspx?RaceDate=2023/04/06&Racecourse=HV&RaceNo=1')
scrape_ranking(url_1)
这段代码不会崩溃。但是,它不是打印马记录表,而是打印来自本网站的原始表"https://racing.hkjc.com/racing/information/English/racing/RaceCard.aspx?RaceDate=2023/04/06&Racecourse=HV&RaceNo=1"(比赛卡)。
是否有一种方法可以使代码点击马的名字(链接),这导致它到一个新的网站(马的记录),并打印出表?
网站打开一个弹出窗口,显示马的详细信息。你可以在文档中使用处理弹出窗口和等待页面加载的代码:
# ...
page.goto(url)
with page.expect_popup() as popup_info:
page.click('text="AI ONE"')
popup = popup_info.value
popup.wait_for_load_state("domcontentloaded")
html = popup.content()
# ...