我正在使用BeautifulSoup来刮擦公司网站工作职位(我有许可)。以下代码能够运行,输出是作业发布的URL,但是我想添加一个标准,该标准必须在返回URL之前是正确的。
当前代码
import requests
from bs4 import BeautifulSoup
base = "http://implementconsultinggroup.com"
url = "http://implementconsultinggroup.com/career/#/1143"
req = requests.get(url).text
soup = BeautifulSoup(req,'html.parser')
links = soup.select("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
res = requests.get(base + link.get("href")).text
soup = BeautifulSoup(res,'html.parser')
title = soup.select_one("h1.page-intro__title").get_text() if
soup.select_one("h1.section__title") else ""
overview = soup.select_one("p.page-intro__longDescription").get_text()
details = soup.select_one("div.rte").get_text()
print(title, link, details)
我要实现的目标
我想运行上述代码,但仅适用于"级别"为=研究生的链接,我想实际显示输出。我已经写了下面的文章,但它不起作用。
level = soup.find_all('dd', {'class': 'author'})
if "Graduate" in text
我从
刮擦网站http://implementconsultinggroup.com/career/#/1143
<a href="/career/management-consultants-within-supply-chain-management/" class="box-link">
<h2 class="article__title--tiny" data-searchable-text="">Management consultants within supply chain management</h2>
<p class="article__longDescription" data-searchable-text="">COPENHAGEN • We are looking for bright graduates with a passion for supply chain management and supply chain planning for our planning and execution excellence team.</p>
<div class="styled-link styled-icon">
<span class="icon icon-icon">
<i class="fa fa-chevron-right"></i>
</span>
<span class="icon-text">View Position</span>
</div>
</a>
<div class="small-12 medium-3 columns top-lined">
<dl>
<dt>Position</dt>
<dd class="author">Management Consultant</dd>
<dt>Level</dt>
<dd class="author">Graduate</dd>
<dt>Expertise</dt>
<dd class="author">Operations strategy, Supply chain management</dd>
<dt>Location</dt>
<dd class="author">Copenhagen</dd>
</dl>
</div>
理想的输出
理想情况下,我将能够运行我创建的代码,并且它将过滤掉级别!=毕业生的位置。
您去这里:
req = requests.get(url).text
soup = BeautifulSoup(req,'html.parser')
for li in soup.find('ul', class_='list-articles list').find_all('li'):
level = li.find_all('dd', {'class': 'author'})[1].get_text()
if "Graduate" in level:
links = li.select("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
res = requests.get(base + link.get("href")).text
soup = BeautifulSoup(res,'html.parser')
title = soup.select_one("h1.page-intro__title").get_text() if soup.select_one("h1.section__title") else ""
overview = soup.select_one("p.page-intro__longDescription").get_text()
details = soup.select_one("div.rte").get_text()
print(title, link, details)