使用Selenium删除一个总是有相同URL的网站

我目前正在抓取某个网站，但问题是这个网站总是有相同的URL，这不允许我正确地抓取。我对Selenium还比较陌生，目前我正在想办法抓取给定的网站。网站是：；https://fcraonline.nic.in/fc3_amount.aspx"；。我希望每年都能刮到每个州的每个地区。这是我迄今为止写的代码：

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome(executable_path = "./chromedriver.exe")
driver.get("https://fcraonline.nic.in/fc3_amount.aspx")
# find_elements_by_xpath returns an array of selenium objects.
titles_element = driver.find_elements_by_xpath("adiv[@class=’col-md-12’]")
# use list comprehension to get the actual repo titles and not the selenium objects.
titles = [x.text for x in titles_element]
# print out all the titles.
print('titles:')
print(titles, 'n')

如果有人能指导我/教我解决这个问题，那就太好了。我感谢大家抽出时间。

这里至少有3个问题：

adiv[@class=’col-md-12’]不是有效的XPath表达式
那里也没有//div[@class=’col-md-12’]定位的元素
打开网页后，你必须设置某种等待/延迟，让元素加载，然后才能访问它们

首先，我会使用css选择器，因为它们比xpath通用得多。其次，如果你想得到一个元素的文本，你可以只做driver.find_element_by_css_selector('thecssselector').text。这只会刮去元素中的文本，而这正是你想要做的。希望这能有所帮助。第三，我不确定你试图抓取哪个元素，因为对我来说，除了一些选择框、页眉和顶部菜单之外，没有显示任何数据。确保您不需要导航到正确的页面，或者通过使用带有time.sleep(aNumberInSeconds)的Python内置时间模块来确保您想要的元素已经加载。

编辑：我建议使用Selenium的Expicit/隐式等待函数来等待页面加载，我发现在测试单点故障时，正常的python时间睡眠更容易使用，但对于完成的代码，使用Selenium's更可靠。退房https://www.browserstack.com/guide/selenium-wait-for-page-to-load了解更多信息。

相关内容

最新更新

热门标签：