BeautifulSoup-如何从网站获取不包含div的项目,作为其他项目



我正试图从网站上抓取招聘广告:https://www.jobs.bg/front_job_search.php?frompage=0&add_sh=1&类别%5B0%5D=29&location_sid=1&关键字%5B0%5D=python&term=#分页

我想得到所有可见的数据-职位,职位,简短的描述,如:完整的堆栈;大数据DBA;数据科学、人工智能、ML和嵌入式;测试、QA和刮擦部分是:

result = requests.get("https://www.jobs.bg/front_job_search.php?frompage=0&add_sh=1&categories%5B0%5D=29&location_sid=1&keywords%5B0%5D=python&term=#paging").text
soup = bs4.BeautifulSoup(result, "lxml")
jobs = soup.find_all('td', class_ = "offerslistRow")
for job in jobs:
description = find_all('div', class_="card__subtitle mdc-typography mdc-typography--body2") 

准确地说,这是[0]的一部分,因为有两个类型的简短描述具有相同的类名,但这不是问题所在。

有些广告没有简短的描述,但也没有提到的div部分(它不是空的,根本不存在(。

有没有一种方法可以获得这样的广告的描述以及";N/A";例如或类似的东西?

我假设你想收集所有的工作细节,因为问题有点不清楚。我还对您的代码进行了一些其他更改,并处理了所有可能的情况。

以下代码应该可以完成任务-

import bs4
import requests
result = requests.get("https://www.jobs.bg/front_job_search.php?frompage=0&add_sh=1&categories%5B0%5D=29&location_sid=1&keywords%5B0%5D=python&term=#paging").text
soup = bs4.BeautifulSoup(result, "lxml")
# find all jobs
jobs = soup.find_all('td', class_ = "offerslistRow")
# list to store job title
job_title=[]
# list to store job location
job_location=[]
# list to store domain and skills
domain_and_skills=[]
# loop through the jobs
for job in jobs:
# this check is to remove the other two blocks aligned to the right
if job.find('a',class_="card__title mdc-typography mdc-typography--headline6 text-overflow") is not None: 

# find and append job name
job_name=job.find('a',class_="card__title mdc-typography mdc-typography--headline6 text-overflow")
job_title.append(job_name.text)

# find and append location and salary description
location_salary_desc=job.find('span',class_='card__subtitle mdc-typography mdc-typography--body2 top-margin') 
if location_salary_desc is not None:
job_location.append(location_salary_desc.text.strip())
else:
job_location.append('NA')

# find other two descriptions (Skills and domains)
description = job.find_all(class_="card__subtitle mdc-typography mdc-typography--body2")
# if both are empty (len=0)
if len(description)==0:
domain_and_skills.append('NA')

# if len=1 (can either be skills or domain details)
elif len(description)==1:

# to check if domain is present and skills is empty
if description[0].find('div') is None:
domain_and_skills.append(description[0].text.strip())            

# domain is empty and skills is present
else:
# list to store skills
skills=[]
# find all images in skills section and get alt attribute which contains skill name
images=description[0].find_all('img')
# if no image and only text is present (for example Shell Scripts is not an image, contains text value)
if len(images)==0:
skills.append(description[0].text.strip())

# both image and text is present
else:
# for each image, append skill name in list
for image in images:
skills.append(image['alt'])

# append text to list if not empty
if description[0].text.strip() !='':
skills.append(description[0].text.strip())

#convert list to string
skills_string = ','.join([str(skill) for skill in skills])
domain_and_skills.append(skills_string)

# both domain and skills are present
else:
domain_string=description[0].text.strip()
# similar procedure as above to print skill names
skills=[]
images=description[1].find_all('img')
if len(images)==0:
skills.append(description[1].text.strip())
else:
for image in images:
skills.append(image['alt'])
if description[1].text.strip() !='':
skills.append(description[1].text.strip())
skills_string = ','.join([str(skill) for skill in skills])
#combine domain and skills
domain_string=domain_string+','+skills_string
domain_and_skills.append(domain_string)
for i in range(0,len(job_title)):
print(job_title[i])
print(job_location[i])
print(domain_and_skills[i])

相关内容

最新更新