美丽汤输出<div类= "page-content" >为<div类= "page-content loading" >,没有除法的内容?



我正试图使用beautifulsoup从网站上抓取一些信息,但输出与网页html不同。我试图从网页中获取的内容在中

<div class="page-content">

但在我的美丽群对象中,它显示为:

<div class="page-content loading"></div>

分区中没有任何内容。无论如何,我试图找到我想要的东西,但它一无所获。我还尝试了html5lib和lxml解析器,但这并没有改变输出。浏览器是否运行某种javascript代码,阻止我获取完整的网页html或其他什么?我是新来的,所以任何建议都将不胜感激。

这是我的脚本:

URL = 'https://zone4.ca/race/2020-11-08/c91ec8f6/results'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all("div", class_="racer-row")
print(results)
print(soup)

是的,它肯定是通过javascript查询加载内容的。您可以复制这些查询的内容(标头、有效负载…(并通过requests库手动发送它们,或者(更好的imo(使用类似selenium的浏览器模拟驱动程序来抓取普通页面。

数据是通过JavaScript动态加载的。但是您可以使用这个脚本来构造Ajax请求并解析一些数据:

import re
import json
import requests
from datetime import datetime, timezone
url = 'https://zone4.ca/race/2020-11-08/c91ec8f6/results/'
html_doc = requests.get(url).text
data = re.search(r'callback(({.*}))', html_doc, flags=re.S).group(1).replace("'", '"')
data = json.loads(re.sub(r'([^s]+):', r'"1":', data))
data_url = "https://zone4.ca/public/data/race.json?url={url}&page={page}&channel_id={channelID}&channel_class=StandardRace&entity_id={entityID}"
feed = requests.get(data_url.format(**data)).json()
# uncomment this to print all data:
# print(json.dumps(feed, indent=4))
for racer in feed['tree']['_child_racers']:
print(racer['first_name'][0], racer['last_name'][0])
for t in racer['_child_timedentitys']:        
for i in range(1, 12):
time = t.get('time_{}_list'.format(i))
if not time:
continue
dtobj = datetime.fromtimestamp(time[0][0] / 1_000_000, timezone.utc)
print('tLap {}: {}'.format(i, dtobj))

打印:

Tim Shea
Lap 1: 2020-11-08 14:40:54.611000+00:00
Lap 2: 2020-11-08 14:45:17.259000+00:00
Lap 3: 2020-11-08 14:49:48.259000+00:00
Lap 4: 2020-11-08 14:54:18.778000+00:00
Lap 5: 2020-11-08 14:58:52.099000+00:00
Lap 6: 2020-11-08 15:03:17.700000+00:00
Lap 7: 2020-11-08 15:07:44.818000+00:00
Lap 8: 2020-11-08 15:12:18.896000+00:00
Lap 9: 2020-11-08 15:16:52.010000+00:00
Lap 10: 2020-11-08 15:21:18.897000+00:00
Lap 11: 2020-11-08 15:25:55.058000+00:00
Zachary Steinman
Lap 1: 2020-11-08 14:41:32.912000+00:00
Lap 2: 2020-11-08 14:46:29.458000+00:00
Lap 3: 2020-11-08 14:51:29.970000+00:00
Lap 4: 2020-11-08 14:56:30.875000+00:00
Lap 5: 2020-11-08 15:01:40.057000+00:00
Lap 6: 2020-11-08 15:06:47.620000+00:00
Lap 7: 2020-11-08 15:11:58.790000+00:00
Lap 8: 2020-11-08 15:17:09.099000+00:00
Lap 9: 2020-11-08 15:22:14.819000+00:00
Lap 10: 2020-11-08 15:27:19.859000+00:00
Kent Williams
Lap 1: 2020-11-08 14:42:40.399000+00:00
Lap 2: 2020-11-08 14:48:33.714000+00:00
...and so on.

最新更新