我有URL列表,我需要从中抓取数据。当在新的驱动程序中打开每个url时,网站拒绝连接,所以我决定在新的选项卡中打开每个url(允许这种方式的网站(。下面的代码我正在使用
from selenium import webdriver
import time
from lxml import html
driver = webdriver.Chrome()
driver.get('https://www.google.com/')
file = open('f:\listofurls.txt', 'r')
for aa in file:
aa = aa.strip()
driver.execute_script("window.open('{}');".format(aa))
soup = html.fromstring(driver.page_source)
name = soup.xpath('//div[@class="name"]//text()')
title = soup.xpath('//div[@class="title"]//text()')
print(name, title)
time.sleep(3)
但问题是所有的URL都是一次打开,而不是一个接一个地打开。
您可以尝试以下代码:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from lxml import html
driver = webdriver.Chrome()
driver.get('https://www.google.com/')
file = open('f:\listofurls.txt', 'r')
for aa in file:
#open tab
driver.find_element_by_tag_name('body').send_keys(Keys.COMMAND + 't')
# You can use (Keys.CONTROL + 't') on other OSs
# Load a page
driver.get(aa)
# Make the tests...
soup = html.fromstring(driver.page_source)
name = soup.xpath('//div[@class="name"]//text()')
title = soup.xpath('//div[@class="title"]//text()')
print(name, title)
time.sleep(3)
driver.close()
我认为你必须在循环之前剥离,如下所示:
driver = webdriver.Chrome()
driver.get('https://www.google.com/')
file = open('f:\listofurls.txt', 'r')
aa = file.strip()
for i in aa:
driver.execute_script("window.open('{}');".format(i))
soup = html.fromstring(driver.page_source)
name = soup.xpath('//div[@class="name"]//text()')
title = soup.xpath('//div[@class="title"]//text()')
print(name, title)
time.sleep(3)