Python - 无法让计数器在多处理环境(池、映射)中工作



我需要我的"抓取器"函数中的计数器变量 (list_counter( 在每次迭代到 list1 时递增。

问题是它为每个单独的进程分配一个计数器。

我希望每个进程在循环结束时简单地增加全局list_counter,而不是让每个进程都有自己的计数器。

我尝试将变量作为参数传递,但也无法让它以这种方式工作。

你们怎么看?是否有可能让全局计数器与多个进程一起工作 - 特别是使用池、映射、锁?

from multiprocessing import Lock, Pool
from time import sleep
from bs4 import BeautifulSoup
import re
import requests
exceptions = []
lock = Lock()
list_counter = 0

def scraper(url):  # url is tied to the individual list items
    """
    Testing multiprocessing and requests
    """
    global list_counter
    lock.acquire()
    try:
        scrape = requests.get(url,
                              headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"},
                              timeout=10)
        if scrape.status_code == 200:
            """ --------------------------------------------- """
            # ---------------------------------------------------
            '''           --> SCRAPE ALEXA RANK: <--          '''
            # ---------------------------------------------------
            """ --------------------------------------------- """
            sleep(0.1)
            scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + url,
                                  headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
            html = scrape.content
            soup = BeautifulSoup(html, 'lxml')
            rank = re.findall(r'<popularity[^>]*text="(d+)"', str(soup))
            print("Server Status:", scrape.status_code, '-', u"u2713", '-', list_counter, '-', url, '-', "Rank:", rank[0])
            list_counter = list_counter + 1
        else:
            print("Server Status:", scrape.status_code)
            list_counter = list_counter + 1
            print(list_counter)
            pass
    except BaseException as e:
        exceptions.append(e)
        print()
        print(e)
        print()
        list_counter = list_counter + 1
        print(list_counter)
        pass
    finally:
        lock.release()
if __name__ == '__main__':
    list1 = ["http://www.wallstreetinvestorplace.com/2018/04/cvs-health-corporation-cvs-to-touch-7-54-earnings-growth-for-next-year/",
             "https://macondaily.com/2018/04/06/cetera-advisors-llc-lowers-position-in-cvs-health-cvs.html",
             "http://www.thesportsbank.net/football/liverpool/jurgen-klopp-very-positive-about-mo-salah-injury/",
             "https://www.moneyjournals.com/trump-wasting-time-trying-bring-amazon/",
             "https://www.pmnewsnigeria.com/2018/04/06/fcta-targets-800000-children-for-polio-immunisation/",
             "http://toronto.citynews.ca/2018/04/06/officials-in-canada-braced-for-another-spike-in-illegal-border-crossings/",
             "https://www.pmnewsnigeria.com/2018/04/04/pdp-describes-looters-list-as-plot-to-divert-attention/",
             "https://beyondpesticides.org/dailynewsblog/2018/04/epa-administrator-pruitt-colluding-regulated-industry/",
             "http://thyblackman.com/2018/04/06/robert-mueller-is-searching-for/",
             "https://www.theroar.com.au/2018/04/06/2018-commonwealth-games-swimming-night-2-finals-live-updates-results-blog/",
             "https://medicalresearch.com/pain-research/migraine-linked-to-increased-risk-of-heart-disease-and-stroke/40858/",
             "http://www.investingbizz.com/2018/04/amazon-com-inc-amzn-stock-creates-investors-concerns/",
             "https://stocknewstimes.com/2018/04/06/convergence-investment-partners-llc-grows-position-in-amazon-com-inc-amzn.html",
             "https://factsherald.com/old-food-rules-needs-to-be-updated/",
             "https://www.nextadvisor.com/blog/2018/04/06/the-facebook-scandal-evolves/",
             "http://sacramento.cbslocal.com/2018/04/04/police-family-youtube-shooter/",
             "http://en.brinkwire.com/245768/why-does-stress-lead-to-weight-gain-study-sheds-light/",
             "https://www.marijuana.com/news/2018/04/monterey-bud-jeff-sessions-is-on-the-wrong-side-of-history-science-and-public-opinion/",
             "http://www.stocksgallery.com/2018/04/06/jpmorgan-chase-co-jpm-noted-a-price-change-of-0-80-and-amazon-com-inc-amzn-closes-with-a-move-of-2-92/",
             "https://stocknewstimes.com/2018/04/06/front-barnett-associates-llc-has-2-41-million-position-in-cvs-health-corp-cvs.html",
             "http://www.liveinsurancenews.com/colorado-mental-health-insurance-bill-to-help-consumers-navigate-the-system/",
             "http://newyork.cbslocal.com/2018/04/04/youtube-headquarters-shooting-suspect/",
             "https://ledgergazette.com/2018/04/06/liberty-interactive-co-series-a-liberty-ventures-lvnta-shares-bought-by-brandywine-global-investment-management-llc.html",
             "http://bangaloreweekly.com/2018-04-06-city-holding-co-invests-in-cvs-health-corporation-cvs-shares/",
             "https://www.thenewsguru.com/didnt-know-lawyer-paid-prostitute-130000-donald-trump/",
             "http://www.westlondonsport.com/chelsea/football-wls-conte-gives-two-main-reasons-chelseas-loss-tottenham",
             "https://registrarjournal.com/2018/04/06/amazon-com-inc-amzn-shares-bought-by-lenox-wealth-management-inc.html",
             "http://www.businessdayonline.com/1bn-eca-withdrawal-commence-action-president-buhari-pdp-tasks-nass/",
             "http://www.thesportsbank.net/football/manchester-united/pep-guardiola-asks-for-his-fans-help-vs-united-in-manchester-derby/",
             "https://www.pakistantoday.com.pk/2018/04/06/three-palestinians-martyred-as-new-clashes-erupt-along-gaza-border/",
             "http://www.nasdaqfortune.com/2018/04/06/risky-factor-of-cvs-health-corporation-cvs-is-observed-at-1-03/",
             "https://stocknewstimes.com/2018/04/06/cetera-advisor-networks-llc-decreases-position-in-cvs-health-cvs.html",
             "http://nasdaqjournal.com/index.php/2018/04/06/planet-fitness-inc-nyseplnt-do-analysts-think-you-should-buy/",
             "http://www.tv360nigeria.com/apc-to-hold-national-congress/",
             "https://www.pmnewsnigeria.com/2018/04/03/apc-governors-keep-sealed-lips-after-meeting-with-buhari/",
             "https://www.healththoroughfare.com/diet/healthy-lifestyle-best-foods-you-should-eat-for-weight-loss/7061",
             "https://stocknewstimes.com/2018/04/05/amazon-com-inc-amzn-shares-bought-by-west-oak-capital-llc.html",
             "http://www.current-movie-reviews.com/48428/dr-oz-could-you-be-a-victim-of-sexual-assault-while-on-vacation/",
             "https://www.brecorder.com/2018/04/07/410124/world-health-day-to-be-observed-on-april-7/",
             "http://www.coloradoindependent.com/169637/trump-pruitt-emissions-epa-pollution",
             "https://thecrimereport.org/2018/04/05/will-sessions-new-justice-strategy-turn-the-clock-back-on-civil-rights/",
             "http://en.brinkwire.com/245490/pasta-unlikely-to-cause-weight-gain-as-part-of-a-healthy-diet/"]
    p = Pool(15)  # thread count
    p.map(scraper, list1)  # (function, iterable)
    p.terminate()
    p.join()

您可以使用 concurrent.futures

import concurrent.futures
import urllib.request
from time import sleep
from bs4 import BeautifulSoup
import re
import requests

def scraper(url):
    list_counter = 0
    try:
        scrape = requests.get(url,
                              headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"},
                              timeout=10)
        if scrape.status_code == 200:
            sleep(0.1)
            scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + url,
                                  headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
            html = scrape.content
            soup = BeautifulSoup(html, 'lxml')
            rank = re.findall(r'<popularity[^>]*text="(d+)"', str(soup))
            print("Server Status:", scrape.status_code, '-', u"u2713", '-', list_counter, '-', url, '-', "Rank:", rank[0])
            list_counter = list_counter + 1
        else:
            print("Server Status:", scrape.status_code)
            list_counter = list_counter + 1
            print(list_counter)
            pass
    except BaseException as e:
        exceptions.append(e)
        print()
        print(e)
        print()
        list_counter = list_counter + 1
        print(list_counter)
        pass
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

list1 在此处复制您的列表(以节省空间(

with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
    future_to_url = {executor.submit(load_url, url, 50): url for url in list1}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))
with concurrent.futures.ProcessPoolExecutor() as executor:
    for n, p in zip(list1, executor.map(scraper, list1)):
            print(n, p)

你会得到输出(只有几行(

http://www.coloradoindependent.com/169637/trump-pruitt-emissions-epa-pollution None
Server Status: 200 - ✓ - 0 - https://thecrimereport.org/2018/04/05/will-sessions-new-justice-strategy-turn-the-clock-back-on-civil-rights/ - Rank: 381576
https://thecrimereport.org/2018/04/05/will-sessions-new-justice-strategy-turn-the-clock-back-on-civil-rights/ None
Server Status: 200 - ✓ - 0 - http://en.brinkwire.com/245490/pasta-unlikely-to-cause-weight-gain-as-part-of-a-healthy-diet/ - Rank: 152818
http://en.brinkwire.com/245490/pasta-unlikely-to-cause-weight-gain-as-part-of-a-healthy-diet/ None

进程 它们之间不共享内存。但是您可以使用多处理模块的管理器,以便进程可以操作相同的对象:

manager = multiprocessing.Manager()
list_counter = manager.list()

您必须将list_counter传递给刮板功能。请注意,管理器创建的列表是线程/进程安全的。

最新更新