使用硒和python在鼠标悬停后弹出时提取数据



大家好,这是我的第一个问题。我正在尝试从网站中提取数据。但问题是,它仅在我将鼠标悬停在它上面时才会出现。网站的数据是 http://insideairbnb.com/melbourne/。我想从将鼠标指针悬停在地图上的点上时弹出的面板中提取每个列表的入住率。我正在尝试使用这个堆栈溢出帖子中的@frianH代码 刮擦网站 动态鼠标悬停事件。我是使用硒提取数据的新手。我了解 bs4 包。我还没有成功地找到正确的 xpath 来完成任务。提前谢谢你。到目前为止,我的代码是

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
browser = webdriver.Chrome(options=chrome_options, executable_path='C:\Users\Kunal\chromedriver.exe')
browser.get('http://insideairbnb.com/melbourne/')
browser.maximize_window()
#wait all circle
elements = WebDriverWait(browser, 20).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="map"]/div[1]/div[2]/div[2]/svg')))
table = browser.find_element_by_class_name('leaflet-zoom-animated')
#move perform -> to table
browser.execute_script("arguments[0].scrollIntoView(true);", table)
data = []
for circle in elements:
#move perform -> to each circle
ActionChains(browser).move_to_element(circle).perform()
# wait change mouseover effect
mouseover = WebDriverWait(browser, 30).until(EC.visibility_of_element_located((By.XPATH, '//*[@id="neighbourhoodBoundaries"]')))
data.append(mouseover.text)
print(data[0])

感谢在ADNVACE中

所以我检查了一堆页面,它似乎对Selenium自己的方法非常抗拒,所以我们将不得不依赖javascript。这是完整的代码-

from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
browser = webdriver.Chrome(options=chrome_options, executable_path='chromedriver.exe')
browser.get('http://insideairbnb.com/melbourne/')
browser.maximize_window()
# Set up a 30 seconds webdriver wait
explicit_wait30 = WebDriverWait(browser, 30)
try:
# Wait for all circles to load
circles = explicit_wait30.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'svg.leaflet-zoom-animated > g:nth-child(2) > circle')))
except TimeoutException:
browser.refresh()
data = []
for circle in circles:
# Execute mouseover on the element
browser.execute_script("const mouseoverEvent = new Event('mouseover');arguments[0].dispatchEvent(mouseoverEvent)", circle)
# Wait for the data to appear
listing = explicit_wait30.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '#listingHover')))
# listing now contains the full element list - you can parse this yourself and add the necessary data to `data`
.......
# Close the listing
browser.execute_script("arguments[0].click()", listing.find_element_by_tag_name('button'))

我也使用 css 选择器而不是 XPATH。以下是流程的工作原理-

circles = explicit_wait30.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'svg.leaflet-zoom-animated > g:nth-child(2) > circle')))

这等到所有圆都存在并将它们提取到circles.

请记住,页面加载圆圈的速度非常慢,因此我设置了一个try/except块,如果页面在 30 秒内未加载,则会自动刷新页面。随意更改它,只要你想

现在我们必须遍历所有的圆圈——

for circle in circles:

接下来是在圆圈上模拟一个mouseover事件,我们将使用 javascript 来做到这一点

这就是javascript的样子(请注意,circle指的是我们将从Selenium传递的元素(

const mouseoverEvent = new Event('mouseover');
circle.dispatchEvent(mouseoverEvent)

这就是脚本通过硒执行的方式-

browser.execute_script("const mouseoverEvent = new Event('mouseover');arguments[0].dispatchEvent(mouseoverEvent)", circle)

现在我们必须等待列表出现-

listing = explicit_wait30.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '#listingHover')))

现在,您已经listing哪个元素还包含许多其他元素,您现在可以轻松地提取每个元素并将它们存储在data中。

如果您不关心以不同的方式提取每个元素,那么简单地对listing进行.text将导致这样的事情-

'Tanyan(No other listings)n23127829nSerene room for a single person or a couple.nGreater DandenongnPrivate roomn$37 income/month (est.)n$46 /nightn4 night minimumn10 nights/year (est.)n2.7% occupancy rate (est.)n0.1 reviews/monthn1 reviewsnlast: 20/02/2018nLOW availabilityn0 days/year (0%)nclick listing on map to "pin" details'

就是这样,然后您可以将结果附加到data中,您就完成了!

最新更新