报废数据frim AirBNB使用硒



大家好,我正试图从airbnb中收集一些数据,以便为我的投资组合创建一个小型数据分析项目。我尝试了几个关于BeautifulSoup的教程,但现在都不起作用,即使我使用的链接与他们在教程中使用的链接完全相同。

由于这一点,我转向了Selenium,我实现了进入侧面,我正在尝试在第一阶段提取的名称。然后我想提取所有信息(价格,评论,评级,贫血等(

我的代码如下,但我得到了一个空列表。有人能帮我吗?我怎样才能得到应用程序的名称?

from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
import pandas as pd
from selenium.webdriver.common.by import By
website = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki%2C%20Greece&date_picker_type=calendar&search_type=unknown'
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(website)
titles = driver.find_elements("class name", "n1v28t5c s1cjsi4j dir dir-ltr")

谢谢。

Selenium与bs4一起工作良好,没有任何问题,并获得正确的数据。只需运行代码。

示例:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
import pandas as pd
import time
url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki%2C%20Greece&date_picker_type=calendar&search_type=unknown'
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(url)
driver.maximize_window()
time.sleep(5)
soup=BeautifulSoup(driver.page_source, 'lxml')
for card in soup.select('div[class="c4mnd7m dir dir-ltr"]'):
title = card.select_one('div[class="t1jojoys dir dir-ltr"]').text
price = card.select_one('span[class="a8jt5op dir dir-ltr"]').text
link = 'https://www.airbnb.com' + card.select_one('a[class="ln2bl2p dir dir-ltr"]').get('href')
print(title, price)

输出:

Condo in Thessaloniki $50 per night
Apartment in Thessaloniki $38 per night
Condo in Thessaloniki $80 per night
Apartment in Thessaloniki $66 per night
Condo in Thessaloniki $23 per night
Apartment in Thessaloniki $74 per night
Condo in Thessaloniki $37 per night
Apartment in Thessaloniki $45 per night
Apartment in Thessaloniki $39 per night
Condo in Thessaloniki $27 per night
Apartment in Thessaloniki $28 per night
Condo in Thessaloniki $43 per night
Apartment in Thessaloniki $94 per night
Apartment in Thessaloniki $24 per night
Condo in Thessaloniki $86 per night
Loft in Thessaloniki $23 per night
Apartment in Thessaloníki $45 per night
Apartment in Thessaloniki $44 per night
Condo in Thessaloniki $50 per night
Condo in Thessaloniki $51 per night

要提取属性的名称,您必须诱导WebDriverWait等待可见性_of_all_elements_located((并且您可以使用以下定位策略之一:

  • 使用CSS_SELECTOR

    driver.get('https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki%2C%20Greece&date_picker_type=calendar&search_type=unknown')
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[id^='title']")))])
    
  • 使用XPATH:

    driver.get('https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki%2C%20Greece&date_picker_type=calendar&search_type=unknown')
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[starts-with(@id, 'title') and text()]")))])
    
  • 控制台输出:

    ['Flat in Thessaloniki', 'Apartment in Thessaloniki', 'Flat in Thessaloniki', 'Apartment in Thessaloniki', 'Apartment in Thessaloniki', 'Loft in Thessaloniki', 'Flat in Thessaloniki', 'Flat in Thessaloniki', 'Apartment in Thessaloniki', 'Apartment in Thessaloniki', 'Flat in Thessaloniki', 'Flat in Thessaloniki', 'Apartment in Thessaloniki', 'Flat in Thessaloniki', 'Apartment in Thessaloniki', 'Apartment in Thessaloniki', 'Flat in Thessaloniki', 'Flat in Thessaloniki', 'Flat in Thessaloniki', 'Apartment in Agios Pavlos']
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
driver.find_elements("class name", "n1v28t5c s1cjsi4j dir dir-ltr")

将返回0个元素。By.CLASS_NAME只能查找基于一个类的元素

(">n1v28t5c s1cjsi4j dir dir ltr"实际上是您试图定位的元素的4个独立类(。例如,可以使用XPATH选择器来查找具有多个类的元素。

driver.find_elements(By.XPATH, '//div[@class="n1v28t5c s1cjsi4j dir dir-ltr"]')

这将在页面中找到所有20个元素。我强烈鼓励您了解更多关于XPATH的信息,因为它很容易理解,而且功能强大

最新更新