如何在URL列表中安全地进行多线程以刮擦



我正在从列表中删除多个URL。

它似乎有效,但是输出都混合在一起,彼此不对应。

这是带有线程的代码:

import requests
import pandas
import json
import concurrent.futures
# our list with multiple profiles
profile=['kaid_329989584305166460858587','kaid_896965538702696832878421','kaid_1016087245179855929335360','kaid_107978685698667673890057','kaid_797178279095652336786972','kaid_1071597544417993409487377','kaid_635504323514339937071278','kaid_415838303653268882671828','kaid_176050803424226087137783']
# two lists of the data that we are going to fill up with each profile
link=[]
projects=[]
############### SCRAPING PART ###############
# my scraping function that we are going to use for each item in profile
def scraper (kaid):
        link.append('https://www.khanacademy.org/profile/{}'.format(kaid))
        data = requests.get('https://www.khanacademy.org/api/internal/user/scratchpads?casing=camel&kaid={}&sort=1&page=0&limit=40000&subject=all&lang=en&_=190425-1456-9243a2c09af3_1556290764747'.format(kaid))
        try:
            data=data.json()
            projects.append(str(len(data['scratchpads'])))
        except json.decoder.JSONDecodeError:
            projects.append('NA')
# the threading part
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    future_kaid = {executor.submit(scraper, kaid): kaid for kaid in profile}
    for future in concurrent.futures.as_completed(future_kaid):
        kaid = future_kaid[future]
############### WRITING PART ##############
# Now we write everything into a a dataframe object
d = {'link':link,'projects':projects}
dataframe = pandas.DataFrame(data=d)
print(dataframe)

我期待这个(我在没有线程的情况下获得的输出(:

                                                link projects
0  https://www.khanacademy.org/profile/kaid_32998...        0
1  https://www.khanacademy.org/profile/kaid_89696...      219
2  https://www.khanacademy.org/profile/kaid_10160...       22
3  https://www.khanacademy.org/profile/kaid_10797...        0
4  https://www.khanacademy.org/profile/kaid_79717...        0
5  https://www.khanacademy.org/profile/kaid_10715...       12
6  https://www.khanacademy.org/profile/kaid_63550...      365
7  https://www.khanacademy.org/profile/kaid_41583...       NA
8  https://www.khanacademy.org/profile/kaid_17605...        2

但是,我明白了:

                                                link projects
0  https://www.khanacademy.org/profile/kaid_32998...        0
1  https://www.khanacademy.org/profile/kaid_89696...        0
2  https://www.khanacademy.org/profile/kaid_10160...        0
3  https://www.khanacademy.org/profile/kaid_10797...       22
4  https://www.khanacademy.org/profile/kaid_79717...       NA
5  https://www.khanacademy.org/profile/kaid_10715...       12
6  https://www.khanacademy.org/profile/kaid_63550...        2
7  https://www.khanacademy.org/profile/kaid_41583...      219
8  https://www.khanacademy.org/profile/kaid_17605...      365

看起来相同,但实际上我们可以看到我们的link与我们的projects不正确。它混合了。

除了SCRAPING PART

以外,我没有线程的代码是相同的
# first part of the scraping
for kaid in profile:
    link.append('https://www.khanacademy.org/profile/{}'.format(kaid))
# second part of the scraping
for kaid in profile:
    data = requests.get('https://www.khanacademy.org/api/internal/user/scratchpads?casing=camel&kaid={}&sort=1&page=0&limit=40000&subject=all&lang=en&_=190425-1456-9243a2c09af3_1556290764747'.format(kaid))
    try:
        data=data.json()
        projects.append(str(len(data['scratchpads'])))
    except json.decoder.JSONDecodeError:
        projects.append('NA')

我的线程代码有什么问题?为什么混合在一起?

尝试这样的事情?不用附加到链接上,然后在经过一段时间的代码执行后将其附加到项目上,而应依次将其附加附加,而应解决问题。但是我正在考虑一种更好的方法ATM ...

d = {'link' : [], 'projects' : []}
############### SCRAPING PART ###############
# my scraping function that we are going to use for each item in profile
def scraper (kaid):
        link = 'https://www.khanacademy.org/profile/{}'.format(kaid)
        data = requests.get('https://www.khanacademy.org/api/internal/user/scratchpads?casing=camel&kaid={}&sort=1&page=0&limit=40000&subject=all&lang=en&_=190425-1456-9243a2c09af3_1556290764747'.format(kaid))
        try:
            data=data.json()
            projects = str(len(data['scratchpads']))
        except json.decoder.JSONDecodeError:
            projects ='NA'
        d['link'].append(link)
        d['projects'].append(projects)

不同的解决方案(不是真的(

或更高

def scraper (kaid):
        link = 'https://www.khanacademy.org/profile/{}'.format(kaid)
        data = requests.get('https://www.khanacademy.org/api/internal/user/scratchpads?casing=camel&kaid={}&sort=1&page=0&limit=40000&subject=all&lang=en&_=190425-1456-9243a2c09af3_1556290764747'.format(kaid))
        try:
            data=data.json()
            projects = str(len(data['scratchpads']))
        except json.decoder.JSONDecodeError:
            projects = 'NA'
        return link, projects
# the threading part
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    future_kaid = {executor.submit(scraper, kaid): kaid for kaid in profile}
    for future in concurrent.futures.as_completed(future_kaid):
        kaid = future_kaid[future]
        data = future.result()
        link.append(data[0])
        projects.append(data[1])

我会说第二个是更好的解决方案,因为这会等待所有数据在将所有数据处理到数据框中之前。对于第一个,仍然有可能发生未对准的时间(但是,由于我们在谈论Gigahertz的时钟速度差异的滴答时,它们非常苗条,但仅仅为了完全消除了这个机会,第二个选择是更好的(。

最新更新