如何正确进行多处理?无效的网址"h":未提供架构



我正在尝试从相当数量的链接中抓取信息,首先我得到团队链接(20(,然后是玩家链接(550(。我正在尝试通过使用多处理来加快该过程。但是我没有使用它的经验,并且在尝试运行我的代码时出现以下错误:

Traceback (most recent call last):
File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "scrape.py", line 50, in playerlinks
squadPage = requests.get(teamLinks[i])
File "/anaconda3/lib/python3.6/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/anaconda3/lib/python3.6/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 519, in request
prep = self.prepare_request(req)
File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 462, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/anaconda3/lib/python3.6/site-packages/requests/models.py", line 313, in prepare
self.prepare_url(url, params)
File "/anaconda3/lib/python3.6/site-packages/requests/models.py", line 387, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "scrape.py", line 94, in <module>
records = p.map(playerlinks, team)
File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?

我不明白为什么,因为所有链接都以 http://开头。 为什么多进程无法正确执行?下面是完整的代码。

from lxml import html
import requests
import pandas as pandas
import numpy as numpy
import re
from multiprocessing import Pool
#Take site and structure html
page = requests.get('https://www.premierleague.com/clubs')
tree = html.fromstring(page.content)
def teamlinks():
#Using the page's CSS classes, extract all links pointing to a team
linkLocation = tree.cssselect('.indexItem')
#Create an empty list for us to send each team's link to
teamLinks = []
#For each link...
for i in range(0,20):
#...Find the page the link is going to...
temp = linkLocation[i].attrib['href']
#...Add the link to the website domain...
temp = "http://www.premierleague.com/" + temp
#...Change the link text so that it points to the squad list, not the page overview...
temp = temp.replace("overview", "squad")
#...Add the finished link to our teamLinks list...
teamLinks.append(temp)
return teamLinks
#Create empty lists for player links
playerLink1 = []
playerLink2 = []
def playerlinks(teamLinks):

#For each team link page...
for i in range(len(teamLinks)):
#...Download the team page and process the html code...
squadPage = requests.get(teamLinks[i])
squadTree = html.fromstring(squadPage.content)
#...Extract the player links...
playerLocation = squadTree.cssselect('.playerOverviewCard')
#...For each player link within the team page...
for i in range(len(playerLocation)):
#...Save the link, complete with domain...
playerLink1.append("http://www.premierleague.com/" + playerLocation[i].attrib['href'])
#...For the second link, change the page from player overview to stats
playerLink2.append(playerLink1[i].replace("overview", "stats"))
return playerLink1, playerLink2
def position():
#Create lists for position
Position = []
#Populate list with each position
#For each player...
for i in range(len(playerLink1)):
#...download and process the one page collected earlier...
playerPage1 = requests.get(playerLink1[i])
playerTree1 = html.fromstring(playerPage1.content)
#...find the relevant datapoint for position...
try:
tempName = str(playerTree1.cssselect('div.info')[7].text_content())
except IndexError:
tempTeam = str("NaN") 
Position.append(tempName)
return Position

if __name__ == '__main__':
team = teamlinks()
p = Pool()  # Pool tells how many at a time
records = p.map(playerlinks, team)
p.terminate()
p.join()

通过使用p.map(playerlinks, team),python试图做的是将函数playerlinks应用于team的每个元素。

但是,根据您的函数定义,函数playerlinks设计为同时对整个列表进行操作。你看到问题了吗?

这就是您的team变量所包含的内容 -

['http://www.premierleague.com//clubs/1/Arsenal/squad',
'http://www.premierleague.com//clubs/2/Aston-Villa/squad',
'http://www.premierleague.com//clubs/127/Bournemouth/squad',
'http://www.premierleague.com//clubs/131/Brighton-and-Hove-Albion/squad',
'http://www.premierleague.com//clubs/43/Burnley/squad',
'http://www.premierleague.com//clubs/4/Chelsea/squad',
'http://www.premierleague.com//clubs/6/Crystal-Palace/squad',
'http://www.premierleague.com//clubs/7/Everton/squad',
'http://www.premierleague.com//clubs/26/Leicester-City/squad',
'http://www.premierleague.com//clubs/10/Liverpool/squad',
'http://www.premierleague.com//clubs/11/Manchester-City/squad',
'http://www.premierleague.com//clubs/12/Manchester-United/squad',
'http://www.premierleague.com//clubs/23/Newcastle-United/squad',
'http://www.premierleague.com//clubs/14/Norwich-City/squad',
'http://www.premierleague.com//clubs/18/Sheffield-United/squad',
'http://www.premierleague.com//clubs/20/Southampton/squad',
'http://www.premierleague.com//clubs/21/Tottenham-Hotspur/squad',
'http://www.premierleague.com//clubs/33/Watford/squad',
'http://www.premierleague.com//clubs/25/West-Ham-United/squad',
'http://www.premierleague.com//clubs/38/Wolverhampton-Wanderers/squad'] 

多处理库将尝试调度

playerlinks(['http://www.premierleague.com//clubs/1/Arsenal/squad'])
playerlinks(['http://www.premierleague.com//clubs/2/Aston-Villa/squad']).... 

n内核数。

playerlinks(['http://www.premierleague.com//clubs/1/Arsenal/squad'])是引发错误的原因。

修改您的playerlinks函数以对team变量中的单个元素进行操作,然后您将看到此问题消失。

试试这样的事情——

def playerlinks_atomic(teamLinks):
squadPage = requests.get(teamLinks)
squadTree = html.fromstring(squadPage.content)
#...Extract the player links...
playerLocation = squadTree.cssselect('.playerOverviewCard')
#...For each player link within the team page...
for i in range(len(playerLocation)):
#...Save the link, complete with domain...
playerLink1.append("http://www.premierleague.com/" + playerLocation[i].attrib['href'])
#...For the second link, change the page from player overview to stats
playerLink2.append(playerLink1[i].replace("overview", "stats"))
return playerLink1, playerLink2

相关内容

  • 没有找到相关文章

最新更新