在Instagram更改其API流程后，如何使用Selenium使用Python抓取Instagram？我找不到所有条目，只能找到 12 个

我正在尝试使用Python和Selenium来抓取Instagram。目标是获得所有帖子的url、评论数量、点赞数量等。

我能够抓取一些数据，但由于某种原因，页面上显示的最新条目不超过12个。我想不出显示所有其他条目的方法。我甚至试着向下滚动，然后阅读页面，但只给出了12。我检查了来源，但找不到如何获取其余条目。看起来这12个条目都嵌入到了script标签中，我在其他地方看不到它。

driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://www.instagram.com/fazeapparel/?hl=en')
source = driver.page_source
data=bs(source, 'html.parser')
body = data.find('body')
script = body.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)

使用检索到的数据，我能够找到信息并收集它们。

for each in data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
link = 'https://www.instagram.com'+'/p/'+each['node']['shortcode']+'/'
posttext = each['node']['edge_media_to_caption']['edges'][0]['node']['text'].replace('n','')
comments = each['node']['edge_media_to_comment']['count']
likes = each['node']['edge_liked_by']['count']
postimage = each['node']['thumbnail_src']
isvideo = each['node']['is_video']
postdate = time.strftime('%Y %b %d %H:%M:%S', time.localtime(each['node']['taken_at_timestamp']))
links.append([link, posttext, comments, likes, postimage, isvideo, postdate])

我甚至创建了一个滚动函数来滚动窗口，然后抓取数据，但它只返回12。

有什么办法可以让我获得12个以上的参赛作品吗？这个帐户有46个条目，我在代码中找不到它。请帮忙！

编辑：我认为数据嵌入在React中，所以它不会显示的所有帖子

您是否使用OpenQA.Selenium.Support.UI添加了？它有一个WebDriverWait，您可以等待元素可见。很抱歉在C#中这样做。Boxes应返回所有帖子。

再说一遍，我知道它不在Python中，但我希望它能有所帮助。

IWebDriver driver = new ChromeDriver("C:\Users\admin\downloads", options);
WebDriverWait wait = new WebDriverWait(driver, time);
driver.Navigate().GoToUrl("www.instagram.comcnn");
IWebElement mainDocument = wait.Until(SeleniumExtras.WaitHelpers.ExpectedConditions.ElementExists(By.TagName("body")));
IWebElement element  = mainDocument.FindElements(By.CssSelector("#react-root > section > main > div > div._2z6nI > article > div > div");
IList <IWebElement> boxes =  element.FindElements(By.TagName("div"));
foreach (var posts in boxes)
{
//do stuff here
}

编辑：

它在后端进行ajax调用，以便在滚动时加载下一篇文章。一种方法可能是运行一个向下滚动的脚本。您可能希望用selenium调用此脚本。我会给添加一个逻辑，让它在脚本运行时等待，并检查它是否返回"STOP"。任何类型的线程睡眠都会阻塞线程。我会使用一些计时器的启动来调用运行脚本的方法。

function scrollDown() {
//once this bottom element disappears we found all the posts
var bottom = document.querySelector('._4emnV')
if (bottom != null) {
window.scroll(0,999999)
}
else
{
return "STOP"
}
}

相关内容

最新更新

热门标签：