如何__Scrape__在HTML内容和AJAX响应中都找不到的注释部分内容



在主题中,我想抓取网站上的评论;项目活动;章节:https://www.donorschoose.org/project/social-distancing-in-kindergarten/5025093/?context=false

然而,我不明白的是,在纯HTML和XHR调用的响应中都找不到内容文本。

这是我知识的终结,除了上面的两个技巧,我不知道该怎么办,我有点不知道这些文本到底来自哪里,以及我可以用什么方式抓取它们。有人能启发我一下吗?

非常感谢!!

您可以使用此脚本从外部URL:加载注释

import re
import json
import requests

url = 'https://www.donorschoose.org/project/social-distancing-in-kindergarten/5025093/?context=false'
comments_url = 'https://cdn.donorschoose.net/dwr/jsonp/ProposalMessageWebService/getProposalMessagesByProposalId?callback=projectTimelineCallback&param0={id}&context=false'

id_ = re.search(r'/(d+)/', url).group(1)
text = requests.get(comments_url.format(id=id_)).text
text = re.search(r'((.*))', text).group(1)
data = json.loads( re.sub(r'new Date((d+))', r'1', text) )
# uncomment this to see all data:
# print(json.dumps(data, indent=4))
# print some info to screen:
for t in data['data']['threads']:
print(t['original']['author']['firstName'])
print(t['original']['message'])
print('-' * 80)

打印:

Stephanie
purchased the <a href="#materials"><span>resources</span></a> for Ms. Carway's classroom and notified the school principal of delivery
--------------------------------------------------------------------------------
Maree
<a href="#letter"><img alt="Teacher Mail" src="https://cdn.donorschoose.net/images/project/posted_mail.gif"><span>Thank You Letter</span></a> posted!
--------------------------------------------------------------------------------
Maree
<strong class='good-news'>Good news: Project fully funded!</strong>
--------------------------------------------------------------------------------
...and so on.

最新更新