我写了一个web抓取代码,它扫描工作门户中的所有页面,并报告满足该函数中薪资要求的工作机会。对我来说,重要的领域是职位、雇主、工资和链接。我现在使用的是getText((方法,但它接受所有元素。结果看起来像:
Zubný lekár/lekárka DENTAL CARE Dr. Rosa, s. r. o.Námestie sv. Františka, Karlova Ves
Od 4 500 EUR/mesiac
Pridané Pred 4 dňami Pridať k vybraným
https://www.profesia.sk/praca/dental-care-dr-rosa/O3863429
https://www.profesia.sk/praca/dental-care-dr-rosa/O3863429
Head of Core Technology DevelopmentESET, spol. s r.o.Bratislava
4 500 EUR/mesiac
Pridané pred 2 týždňami Pridať k vybraným
https://www.profesia.sk/praca/eset/C22141
https://www.profesia.sk/praca/eset/O3933805
https://www.profesia.sk/praca/eset/O3933805
它需要两个不必要的项目并复制链接(因为<a'href'中有2到3个链接(有更好的想法吗?
def search4job(salary):
import bs4, requests, re
#Classes -> employer: class='employer'>
# -> salary ".label"
# -> Job Title class='title'
# -> TODO: link
base_url= 'https://www.profesia.sk/praca/bratislava/plny-uvazok/?languages=73&page_num={}'
page = 1 #to start from page1
request = requests.get(base_url.format(page)) #to take complete url
HTML = bs4.BeautifulSoup(request.text,'lxml')
pattern = r'(dsddd)' #salary pattern
while len(HTML.select(".list-row"))>0:
#in pages without job offer the len of list-row is 0, iterates until there are no job offers
#iteration within the page, return Job Details
for i in HTML.select(".list-row"):
#to give result only when there's a salary shown
if i.find('span',{'class':'label-group'}):
try:
#to give result only if the salary is higher than the one i want
if int(str(re.search(pattern,str(i.find('span',{'class':'label-group'}))).group()).replace(" ",""))>=salary:
#print Job Details
print(i.getText())
#print job offer link
try:
for link in i.findAll('a',attrs={'href':re.compile("/praca/")}):
print('https://www.profesia.sk'+str(link.get('href')))
except:
print("There is an error")
print('n') #new line between job offers
except:
pass
#iteration over the pages
page +=1
request = requests.get(base_url.format(page))
HTML = bs4.BeautifulSoup(request.text,'lxml')
#RESTART UNTIL THERE ARE JOB OFFERS
search4job(4000)
您可以使用此脚本从Profisia:中抓取数据
import requests
from bs4 import BeautifulSoup
base_url= 'https://www.profesia.sk/praca/bratislava/plny-uvazok/?languages=73&page_num={}'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
page = 1
while True:
print(base_url.format(page))
soup = BeautifulSoup(requests.get(base_url.format(page)).content, 'html.parser')
links = soup.select('h2 > a')
if not links:
break
for l in links:
soup = BeautifulSoup(requests.get('https://www.profesia.sk' + l['href']).content, 'html.parser')
job_title = soup.h1.text
employer = soup.select_one('[itemprop="hiringOrganization"]')
employer = employer.text if employer else '-'
salary = soup.select_one('span[class^="salary"]')
salary = salary.text if salary else '-'
print(job_title)
print(employer)
print(salary)
print('https://www.profesia.sk' + l['href'])
print('-' * 80)
page += 1
打印:
https://www.profesia.sk/praca/bratislava/plny-uvazok/?languages=73&page_num=1
Front Office Manager - AC Hotel by Marriott Bratislava Old Town
Legendhotels Slovakia, s.r.o.
From 1 800 EUR/month
https://www.profesia.sk/praca/legendhotels-slovakia/O3955894
--------------------------------------------------------------------------------
Catalog Quality Associate - Polish & Spanish
Amazon /Slovakia/ s.r.o.
1 150 EUR/month
https://www.profesia.sk/praca/amazon-slovakia/O3937464
--------------------------------------------------------------------------------
Financial Manager - AC Hotel by Marriott Bratislava Old Town
Legendhotels Slovakia, s.r.o.
From 2 200 EUR/month
https://www.profesia.sk/praca/legendhotels-slovakia/O3955853
--------------------------------------------------------------------------------
...and so on.