在Python中一个接一个地运行线程



更新1:

如果我将for循环中的代码更改为:

print('processing new page')
pool.apply_async(time.sleep, (5,))

每次打印后,我都会看到5秒的延迟,所以问题与网络驱动程序无关。

更新2:

感谢@user56700,但我很想知道我在这里做了什么,以及如何在不改变线程使用方式的情况下进行修复。


在python中,我有以下代码:

driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
for url in url:
try:
print('processing new page')
result = parse_page(driver, url) # Visit url via driver, wait for it to load and parse its contents (takes 30 sec per page)
# Change global variables
except Exception as e:
log_warning(str(e))

如果我有10页,上面的代码需要300秒才能完成,这太多了。

我读到关于python中线程的内容:https://stackoverflow.com/a/15144765/19500354所以我想用它,但我不确定我做的方式是否正确。

这是我的尝试:

import threading
from multiprocessing.pool import ThreadPool as Pool
G_LOCK = threading.Lock()
driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
pool = Pool(10)
for url in url:
try:
print('processing new page')
result = pool.apply_async(parse_page, (driver, url,)).get()
G_LOCK.acquire()
# Change global variables
G_LOCK.release()
except Exception as e:
log_warning(str(e))
pool.close()
pool.join()
# Here I want to make sure ALL threads have finished working before running the below code

为什么我的实现是错误的?注意:我使用的是相同的驱动程序实例

我试图在processing new page旁边打印时间,我看到:

[10:36:02] processing new page
[10:36:09] processing new page
[10:36:15] processing new page
[10:36:22] processing new page
[10:36:39] processing new page

这意味着出了问题,因为我预计会有1秒的差异。因为我所做的只是改变全局变量。

我刚刚创建了一个简单的例子来展示如何解决它。当然,您需要添加自己的代码。

from concurrent.futures import ThreadPoolExecutor, as_completed
from selenium import webdriver
driver = webdriver.Chrome()
urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
your_data = []
def parse_page(driver, url):
driver.get(url)
data = driver.title
return(data)
with ThreadPoolExecutor(max_workers=10) as executor:
results = {executor.submit(parse_page, driver, url) for url in urls}
for result in as_completed(results):
your_data.append(result.result())
driver.close()
print(your_data)

结果:

['Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia']

如果你愿意,你可以使用网络驱动程序作为上下文管理器来避免关闭它,比如:

with webdriver.Chrome() as driver:
urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
your_data = []
def parse_page(driver, url):
driver.get(url)
data = driver.title
return(data)
with ThreadPoolExecutor(max_workers=10) as executor:
results = {executor.submit(parse_page, driver, url) for url in urls}
for result in as_completed(results):
your_data.append(result.result())
print(your_data)

使用multiprocessing.pool库的示例:

from selenium import webdriver
from multiprocessing.pool import ThreadPool
with webdriver.Chrome() as driver:
urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
your_data = []
def parse_page(driver, url):
driver.get(url)
data = driver.title
return(data)
pool = ThreadPool(processes=10)
results = [pool.apply_async(parse_page, (driver, url)) for url in urls]
for result in results:
your_data.append(result.get())
print(your_data)

结果:

['Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia']

相关内容

最新更新