小贝子编程

使用Python请求和BeautifulSoup从网站抓取文本失败

本文关键字：网站抓取取文本失败 BeautifulSoup Python 请求求和使用 python beautifulsoup python-requests web-crawler
更新时间 : 2023-09-21
英文 : Crawling text from website with Python Requests and BeautifulSoup fails

我想自动阅读一些招聘广告。为此，我实现了以下程序，该程序适用于大多数网页：

def getTextFromWeb(url):
website = requests.get(url)
soup = BeautifulSoup(website.content)
temp = soup.findAll(text=True)
xvec = []
for x in temp:
if (len(x) > 1):
xvec.append(x)
text = 'n'.join(xvec)
return text

然而，我无法阅读包含javascript的网页的相关文本。有什么想法可以加强上面的程序吗？非常感谢！

数据位于源html中的<script>标记中。您需要从那里解析json格式的内容：

from bs4 import BeautifulSoup 
import requests 
import json
url = 'https://jobs.swp.de/jobs/4775408/Regionalleiter_(m_w_d)_Donaueschingen___Freudenst'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
script = soup.find('script', {'type':'application/ld+json'})
jsonData = json.loads(script.text)
print(jsonData['description'])

使用Python请求和BeautifulSoup从网站抓取文本失败

相关内容

最新更新

热门标签：