从 Indeed 中提取数据的问题 by BeautifulSoup.



我正在尝试从 Indeed 网站中提取每个帖子的职位描述,但结果不是我预期的!

我写了一个代码来获取工作描述。我正在使用python 2.7和最新的beautifulsoup。当您打开页面并单击每个职位时,您将在屏幕右侧看到相关信息。我需要在此页面上为每个工作提取这些工作描述。我的代码:

import sys
import urllib2 
from BeautifulSoup import BeautifulSoup
url = "https://www.indeed.com/jobs?q=construction%20manager&l=Houston%2C%20TX&vjk=8000b2656aae5c08"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
N = soup.findAll("div", {"id" : "vjs-desc"})
print N

我希望看到结果,但相反,我得到了 [] 作为结果。是因为 Id 是非唯一的。如果是这样,我应该如何编辑代码?

#vjs-desc元素由JavaScript生成,内容来自Ajax请求。要获取描述,您需要执行该请求。

# -*- coding: utf-8 -*-
# it easier to create http request/session using this
import requests
import re, urllib
from BeautifulSoup import BeautifulSoup
url = "https://www......"
# create session
s = requests.session()
html = s.get(url).text
# exctract job IDs
job_ids = ','.join(re.findall(r"jobKeysWithInfo['(.+?)']", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' + urllib.quote(job_ids)
# do Ajax request and convert the response to json 
ajax_content = s.get(ajax_url).json()
print(ajax_content)
for id, desc in ajax_content.items():
    print id
    soup = BeautifulSoup(desc, 'html.parser')
    # or try this
    # soup = BeautifulSoup(desc.decode('unicode-escape'), 'html.parser')
    print soup.text.encode('utf-8')
    print('==============================')

相关内容

最新更新