我需要创建一个包含以下列的数据框架:
WEB | Country | Organisation
我是从一个网站上提取这些信息的:然而,有一些网站在网站上没有任何信息。这导致我在更新数据框架时出现一些问题。不幸的是,代码一次只能在一个网站上工作,否则会出现验证码。请参阅下面的代码来了解单个输出:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
element=[]
organisation=[]
x=['stackoverflow.com'] # ['livevsfox.ca'] I would suggest to try first one, then the other one
frame_dict={}
element.append(x) # I am keeping this just because I'd like to consider a for loop in future
chrome_options = webdriver.ChromeOptions()
driver=webdriver.Chrome('path')
response=driver.get('website/'+x) # here x should stackoverflow.com, then the other web
try:
wait = WebDriverWait(driver, 30)
driver.execute_script("window.scrollTo(0, 1000)")
try:
error = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,"section.selection div.container h2"))) # updated after answer from another post and comment below
except:
continue
# Country
c = wait.until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Country']/../following-sibling::div"))).text
country.append(c)
# Organisation
try:
org=wait.until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Organisation']/../following-sibling::div"))).text
organisation.append(org)
except:
organisation.append("Data not available")
except:
break
driver.quit()
frame_dict.update({'WEB': element, 'Organisation': organisation, 'Country': country})
df=pd.DataFrame.from_dict(frame_dict)
代码应该做到以下几点:
- for
x = stackoverflow.com
(这只是一个工作url的例子),打开chrome;如果有信息,那么提取有关组织和国家的信息;如果没有,在数据框中添加"Missing";出口铬; - ;如果有信息,那么提取有关组织和国家的信息;如果没有,则在
Organisation
和Country
列中添加"Missing";退出浏览器。
x = livevsfox.ca
为预期输出为:
WEB Country Organisation
stackoverflow.com US Stack Exchange, Inc.
livevsfox.ca Missing Missing
实际上,livevsfox.ca
返回以下消息:
Sorry, livevsfox.ca could not be found or reached (error code 404)
消息,没有出现当我寻找stackoverflow.com。由于stackoverflow.com有国家和组织,我可以在数据框架中添加此信息,但我不能为livesfox做同样的事情。ca。我认为一个可能的解决方案如下:
- 检查
h2 class
元素是否包含上述消息("Sorry, x could not be found or reached (error code 404)"
):这将意味着web没有检测到信息; - 如果web没有信息,那么在数据框中添加
Missing
(或NA
,由您决定); - ,否则,网站有信息(Owner & 将被添加到数据框中。
希望你能提供一些帮助。
我已经找到解决这个问题的方法了。
首先,我检测h2 class
元素,如下所示:
message = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,"section.section div.container h2"))).text
然后,检查message
是否包含特定文本;例如,
if 'Sorry,' in message:
如果是,那么我将值附加到我的列表中,然后添加到数据框架中:
organisation.append('Missing')
country.append('Missing')
代码:
try:
message = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,"section.section div.container h2"))).text
if 'Sorry,' in message:
organisation.append('Missing')
country.append('Missing')
except:
continue