我使用multiprocessing
库在python中创建了一个脚本,从网页中抓取某些字段。由于我不知道如何使用multiprocessing
,我在执行以下脚本时出错:
import requests
from lxml.html import fromstring
from multiprocessing import Process
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def create_links(url):
response = requests.get(url).text
tree = fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span")[0].text
street = title.cssselect("span.street-address")[0].text
try:
phone = title.cssselect("div[class^=phones]")[0].text
except IndexError:
phone = ""
print(name, street, phone)
if __name__ == '__main__':
links = [link.format(page) for page in range(4)]
p = Process(target=create_links, args=(links,))
p.start()
p.join()
我遇到的错误:
722, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
我得到这个错误是因为脚本将链接列表视为一个单独的链接,而我知道我必须在args=(links,)
中传递链接列表。如何成功运行它?
适用于池
import requests
from lxml.html import fromstring
from multiprocessing import Pool
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def create_links(url):
response = requests.get(url).text
tree = fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span")[0].text
street = title.cssselect("span.street-address")[0].text
try:
phone = title.cssselect("div[class^=phones]")[0].text
except IndexError:
phone = ""
print(name, street, phone)
links = [link.format(page) for page in range(4)]
def main():
with Pool(4) as p:
print(p.map(create_links, links))
if __name__ == '__main__':
main()
输出
Caffe Latte 6254 Wilshire Blvd (323) 936-5213
Bourgeois Pig 5931 Franklin Ave (323) 464-6008
Beard Papa Sweet Cafe 6801 Hollywood Blvd Ste 157 (323) 462-6100
Intelligentsia Coffee 3922 W Sunset Blvd (323) 663-6173
The Downbeat Cafe 1202 N Alvarado St (213) 483-3955
Sabor Y Cultura 5625 Hollywood Blvd (323) 466-0481
The Wood Cafe 12000 Washington Pl (310) 915-9663
Groundwork Coffee Inc 1501 N Cahuenga Blvd (323) 871-0143
The Apple Pan 10801 W Pico Blvd (310) 475-3585
Good Microbrew & Grill 3725 W Sunset Blvd (323) 660-3645
The Standard Hollywood 8300 W Sunset Blvd (323) 650-9090
您可以使用来自多处理的Pool
from multiprocessing import Pool
并将流程指定为
links = [link.format(page) for page in range(4)]
p = Pool(10) # number of process at a time
link = p.map(parse, links)
p.terminate()
p.join()
如果您想坚持使用Process
,那么以下方法应该有效:
import requests
from lxml.html import fromstring
from multiprocessing import Process
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def create_links(url):
response = requests.get(url).text
tree = fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span")[0].text
try:
street = title.cssselect("span.street-address")[0].text
except IndexError: street = ""
try:
phone = title.cssselect("div[class^=phones]")[0].text
except IndexError: phone = ""
print(name, street, phone)
if __name__ == '__main__':
items = []
for links in [link.format(page) for page in range(1,6)]:
p = Process(target=create_links, args=(links,))
items.append(p)
p.start()
for process in items:
process.join()