如何__Scrape__在HTML内容和AJAX响应中都找不到的注释部分内容 - How to __Scrape__ Comment Section Content Which Cannot Be Found in Both HTML Content and AJAX Response 小贝子编程网

在主题中，我想抓取网站上的评论；项目活动；章节：https://www.donorschoose.org/project/social-distancing-in-kindergarten/5025093/?context=false

然而，我不明白的是，在纯HTML和XHR调用的响应中都找不到内容文本。

这是我知识的终结，除了上面的两个技巧，我不知道该怎么办，我有点不知道这些文本到底来自哪里，以及我可以用什么方式抓取它们。有人能启发我一下吗？

非常感谢！！

您可以使用此脚本从外部URL:加载注释

import re
import json
import requests

url = 'https://www.donorschoose.org/project/social-distancing-in-kindergarten/5025093/?context=false'
comments_url = 'https://cdn.donorschoose.net/dwr/jsonp/ProposalMessageWebService/getProposalMessagesByProposalId?callback=projectTimelineCallback&param0={id}&context=false'

id_ = re.search(r'/(d+)/', url).group(1)
text = requests.get(comments_url.format(id=id_)).text
text = re.search(r'((.*))', text).group(1)
data = json.loads( re.sub(r'new Date((d+))', r'1', text) )
# uncomment this to see all data:
# print(json.dumps(data, indent=4))
# print some info to screen:
for t in data['data']['threads']:
print(t['original']['author']['firstName'])
print(t['original']['message'])
print('-' * 80)

打印：

Stephanie
purchased the <a href="#materials"><span>resources</span></a> for Ms. Carway's classroom and notified the school principal of delivery
--------------------------------------------------------------------------------
Maree
<a href="#letter"><img alt="Teacher Mail" src="https://cdn.donorschoose.net/images/project/posted_mail.gif"><span>Thank You Letter</span></a> posted!
--------------------------------------------------------------------------------
Maree
<strong class='good-news'>Good news: Project fully funded!</strong>
--------------------------------------------------------------------------------
...and so on.

如何Scrape在HTML内容和AJAX响应中都找不到的注释部分内容

相关内容

最新更新

热门标签：