使用xpath从网站提取信息时未收集数据



我需要从一个网站提取信息。本网站有以下路径内的信息:

<div class="accordion-block__question">
<div class="accordion-block__text">Server</div></div>
...
<div class="block__col"><b>Country</b></div>

运行
try: 
# Country
c=driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'Country')]").get_attribute('textContent')
country.append(c)   
except: 
country.append("Error")

创建一个包含所有错误的df。我对所有领域都感兴趣(但为了解决这个问题,只要一个领域就好了),包括Trustscore(数字),但我不知道是否有可能得到它。我在Chrome上使用selenium, web驱动程序。网址:https://www.scamadviser.com/check-website

这是整个代码:

def scam(df):
chrome_options = webdriver.ChromeOptions()
trust=[]
country = [] 
isp_country = [] 

query=df['URL'].unique().tolist() 
driver=webdriver.Chrome('mypath',chrome_options=chrome_options))

for x in query:

wait = WebDriverWait(driver, 10)
response=driver.get('https://www.scamadviser.com/check-website/'+x)

try: 
wait = WebDriverWait(driver, 30)
# missing trustscore
# Country
c=driver.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight", driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'Country')]")).get_attribute('innerText')
country.append(c)  
# ISP country
ic=driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'ISP')]").get_attribute('innerText')
isp_country.append(ic)

except: 
# missing trustscore
country.append("Error")
isp_country.append("Error")

# Create dataframe
dict = {'URL': query, 'Trustscore':trust, 'Country': country, 'ISP': isp_country} 
df=pd.DataFrame(dict)
driver.quit()

return df

你可以试着用df['URL'] =

stackoverflow.com
gitHub.com

您要查找的是innerText,而不是textContent

代码:

try: 
# Country
c = driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'Country')]").get_attribute('innerText')
print(c)
country.append(c)   
except: 
country.append("Error")

Updated 1:

如果已经使用的定位器是正确的。

driver.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight", driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'Country')]"))

或者可以尝试使用这个xpath的两个选项:-

//div[contains(@class,'block__col')]/b[text()='Country']

updated 2:

试题:wait = WebDriverWait(driver, 30)# missing trustscore

# Country
time.sleep(2)
ele = driver.find_element_by_xpath("//div[contains(@class,'block__col')]/b[text()='Country']")
driver.execute_script("arguments[0].scrollIntoView(true);", ele)
country.append(ele.get_attribute('innerText'))
time.sleep(2)
# ISP country
ic = driver.find_element_by_xpath("//div[contains(@class,'block__col')]/b[text()='ISP']")
driver.execute_script("arguments[0].scrollIntoView(true);", ele)
isp_country.append(ic.get_attribute('innerText'))

update 3:

得到Company data,Country name

使用xpath:

//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Country']/../following-sibling::div

还有,在使用这个xpath之前要确保一些事情。

  1. 以全屏模式启动浏览器
  2. 使用js滚动,然后使用滚动到视图或动作链。

代码:-

driver.maximize_window()
time.sleep(2)
driver.execute_script("window.scrollTo(0, 1000)")
time.sleep(2)
driver.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']"))))
# now use the mentioned xpath.
company_data_country_name` = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Country']/../following-sibling::div")))
print(company_data_country_name.text)

最新更新