我正在从列表中删除多个URL。
它似乎有效,但是输出都混合在一起,彼此不对应。
这是带有线程的代码:
import requests
import pandas
import json
import concurrent.futures
# our list with multiple profiles
profile=['kaid_329989584305166460858587','kaid_896965538702696832878421','kaid_1016087245179855929335360','kaid_107978685698667673890057','kaid_797178279095652336786972','kaid_1071597544417993409487377','kaid_635504323514339937071278','kaid_415838303653268882671828','kaid_176050803424226087137783']
# two lists of the data that we are going to fill up with each profile
link=[]
projects=[]
############### SCRAPING PART ###############
# my scraping function that we are going to use for each item in profile
def scraper (kaid):
link.append('https://www.khanacademy.org/profile/{}'.format(kaid))
data = requests.get('https://www.khanacademy.org/api/internal/user/scratchpads?casing=camel&kaid={}&sort=1&page=0&limit=40000&subject=all&lang=en&_=190425-1456-9243a2c09af3_1556290764747'.format(kaid))
try:
data=data.json()
projects.append(str(len(data['scratchpads'])))
except json.decoder.JSONDecodeError:
projects.append('NA')
# the threading part
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_kaid = {executor.submit(scraper, kaid): kaid for kaid in profile}
for future in concurrent.futures.as_completed(future_kaid):
kaid = future_kaid[future]
############### WRITING PART ##############
# Now we write everything into a a dataframe object
d = {'link':link,'projects':projects}
dataframe = pandas.DataFrame(data=d)
print(dataframe)
我期待这个(我在没有线程的情况下获得的输出(:
link projects
0 https://www.khanacademy.org/profile/kaid_32998... 0
1 https://www.khanacademy.org/profile/kaid_89696... 219
2 https://www.khanacademy.org/profile/kaid_10160... 22
3 https://www.khanacademy.org/profile/kaid_10797... 0
4 https://www.khanacademy.org/profile/kaid_79717... 0
5 https://www.khanacademy.org/profile/kaid_10715... 12
6 https://www.khanacademy.org/profile/kaid_63550... 365
7 https://www.khanacademy.org/profile/kaid_41583... NA
8 https://www.khanacademy.org/profile/kaid_17605... 2
但是,我明白了:
link projects
0 https://www.khanacademy.org/profile/kaid_32998... 0
1 https://www.khanacademy.org/profile/kaid_89696... 0
2 https://www.khanacademy.org/profile/kaid_10160... 0
3 https://www.khanacademy.org/profile/kaid_10797... 22
4 https://www.khanacademy.org/profile/kaid_79717... NA
5 https://www.khanacademy.org/profile/kaid_10715... 12
6 https://www.khanacademy.org/profile/kaid_63550... 2
7 https://www.khanacademy.org/profile/kaid_41583... 219
8 https://www.khanacademy.org/profile/kaid_17605... 365
看起来相同,但实际上我们可以看到我们的link
与我们的projects
不正确。它混合了。
除了SCRAPING PART
# first part of the scraping
for kaid in profile:
link.append('https://www.khanacademy.org/profile/{}'.format(kaid))
# second part of the scraping
for kaid in profile:
data = requests.get('https://www.khanacademy.org/api/internal/user/scratchpads?casing=camel&kaid={}&sort=1&page=0&limit=40000&subject=all&lang=en&_=190425-1456-9243a2c09af3_1556290764747'.format(kaid))
try:
data=data.json()
projects.append(str(len(data['scratchpads'])))
except json.decoder.JSONDecodeError:
projects.append('NA')
我的线程代码有什么问题?为什么混合在一起?
尝试这样的事情?不用附加到链接上,然后在经过一段时间的代码执行后将其附加到项目上,而应依次将其附加附加,而应解决问题。但是我正在考虑一种更好的方法ATM ...
d = {'link' : [], 'projects' : []}
############### SCRAPING PART ###############
# my scraping function that we are going to use for each item in profile
def scraper (kaid):
link = 'https://www.khanacademy.org/profile/{}'.format(kaid)
data = requests.get('https://www.khanacademy.org/api/internal/user/scratchpads?casing=camel&kaid={}&sort=1&page=0&limit=40000&subject=all&lang=en&_=190425-1456-9243a2c09af3_1556290764747'.format(kaid))
try:
data=data.json()
projects = str(len(data['scratchpads']))
except json.decoder.JSONDecodeError:
projects ='NA'
d['link'].append(link)
d['projects'].append(projects)
不同的解决方案(不是真的(
或更高
def scraper (kaid):
link = 'https://www.khanacademy.org/profile/{}'.format(kaid)
data = requests.get('https://www.khanacademy.org/api/internal/user/scratchpads?casing=camel&kaid={}&sort=1&page=0&limit=40000&subject=all&lang=en&_=190425-1456-9243a2c09af3_1556290764747'.format(kaid))
try:
data=data.json()
projects = str(len(data['scratchpads']))
except json.decoder.JSONDecodeError:
projects = 'NA'
return link, projects
# the threading part
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_kaid = {executor.submit(scraper, kaid): kaid for kaid in profile}
for future in concurrent.futures.as_completed(future_kaid):
kaid = future_kaid[future]
data = future.result()
link.append(data[0])
projects.append(data[1])
我会说第二个是更好的解决方案,因为这会等待所有数据在将所有数据处理到数据框中之前。对于第一个,仍然有可能发生未对准的时间(但是,由于我们在谈论Gigahertz的时钟速度差异的滴答时,它们非常苗条,但仅仅为了完全消除了这个机会,第二个选择是更好的(。