为什么我的web scraper没有抓取相关信息?



我用python使用selenium构建了一个web scraper。它运行时没有错误,并打开所请求的url(即使只是一个页面,而不是全部)。但是在代码运行之后,没有输出。我使用pandas创建的csv是空的。

看看我的代码,你看到了吗,为什么它没有刮掉项目?

for i in range(0, 10):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives?page=' + str(i)
driver.get(url)
time.sleep(random.randint(1, 11))
driver.find_elements(By.CSS_SELECTOR, "initivative-item")
initiative_list = []
title = video.find_element(By.XPATH, "./html/body/app-root/ecl-app-standardised/main/div/ng-component/div/section/ux-block-content/div/initivative-item[2]/article/a/div[2]").text
topic = video.find_element(By.XPATH, ".///html/body/app-root/ecl-app-standardised/main/div/ng-component/div/section/ux-block-content/div/initivative-item[1]/article/a/div[3]/div[2]").text
period = video.find_element(By.XPATH, ".///html/body/app-root/ecl-app-standardised/main/div/ng-component/div/section/ux-block-content/div/initivative-item[1]/article/a/div[5]/div/div[2]").text
initiative_item = {
'title': [title],
'topic': [topic],
'period': [period]
}
initiative_list.extend(initiative_item)
df = pd.DataFrame(initiative_list) 
print(df) 
df.to_csv('file_name.csv')

我已经检查了xpath,它们似乎是正确的,因为它们没有引起任何错误。

能否确认您的变量title,topic,period不是空的?

如果没有,是不是在你的initiative_list的循环初始化的某个地方设置为initiative_list = []?这将删除所有已经添加到列表中的内容。

应该可以:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_en'
driver.get(url)
# We save the article list
articles = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, "//article")))
# We make a loop once per article in the loop
for i in range(1, len(articles)):
# We save title, topic and period
title = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"(//article)[{i}]//div[2]"))).text
print(title)
topic = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"(//article)[{i}]//div[3]/div[2]"))).text
print(topic)
period = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"(//article)[{i}]//div[5]/div/div[2]"))).text
print(period)

一旦你有了这些信息,你可以用它做任何你想做的事。

我希望它有帮助

最新更新