如何在登录后抓取数据



我将在一个名为"隔离期间的积极幸福"的论坛中提取帖子;在HealthUnlocked.com我可以提取帖子没有登录,但我不能提取与日志记录的帖子。我用了"url = 'https://solaris.healthunlocked.com/posts/positivewellbeing/popular?pageNumber={0}'.format(page)"提取锅,但我不知道我怎么能连接到登录,因为URL是JSON格式。如果你能帮助我,我会很感激的。

import requests, json
import pandas as pd
from bs4 import BeautifulSoup
from time import sleep

url = "https://healthunlocked.com/private/programs/subscribed?user-id=1290456"
payload = {

"username" : "my username goes here",
"Password" : "my password goes hereh"
}
s= requests.Session()
p= s.post(url, data = payload)
headers = {"user-agent": "Mozilla/5.0"}
pages =2
data = []
listtitles=[]
listpost=[]
listreplies=[]
listpostID=[]
listauthorID=[]
listauthorName=[]
for page in range(1,pages):

url = 'https://solaris.healthunlocked.com/posts/positivewellbeing/popular?pageNumber= 
{0}'.format(page)
r = requests.get(url,headers=headers)
posts = json.loads(r.text)
for post in posts:   

sleep(3.5)
listtitles.append(post['title']) 

listreplies.append(post ["totalResponses"])
listpostID.append(post["postId"])
listauthorID.append(post ["author"]["userId"]) 
listauthorName.append(post ["author"]["username"])

url = 'https://healthunlocked.com/positivewellbeing/posts/{0}'.format(post['postId'])
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.text, 'lxml')

listpost.append(soup.select_one('div.post-body').get_text('|', strip=True))



## save to CSV   
df=pd.DataFrame(list(zip(* 
[listpostID,listtitles,listpost,listreplies,listauthorID,listauthorName]))).add_prefix('Col')
df.to_csv('out1.csv',index=False)
print(df)
sleep(2)

对于大多数网站,您必须首先通过登录获得令牌。大多数时候,这是一块饼干。然后,在授权请求中,您可以发送该cookie。在开发人员工具中打开网络选项卡,然后使用您的用户名和密码登录。您将能够看到请求是如何格式化的以及它的位置。从这里开始,试着在你的代码中复制它。

相关内容

  • 没有找到相关文章

最新更新