我想自动阅读一些招聘广告。为此,我实现了以下程序,该程序适用于大多数网页:
def getTextFromWeb(url):
website = requests.get(url)
soup = BeautifulSoup(website.content)
temp = soup.findAll(text=True)
xvec = []
for x in temp:
if (len(x) > 1):
xvec.append(x)
text = 'n'.join(xvec)
return text
然而,我无法阅读包含javascript的网页的相关文本。有什么想法可以加强上面的程序吗?非常感谢!
数据位于源html中的<script>
标记中。您需要从那里解析json格式的内容:
from bs4 import BeautifulSoup
import requests
import json
url = 'https://jobs.swp.de/jobs/4775408/Regionalleiter_(m_w_d)_Donaueschingen___Freudenst'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
script = soup.find('script', {'type':'application/ld+json'})
jsonData = json.loads(script.text)
print(jsonData['description'])