刮擦时如何处理分页



我为教育目的而抓取的一个网站有分页。

我的代码把第一页刮得很好。。。

但是我该如何刮

?page=2
?page=3
?page=4
?page=5

还有更远的地方??。。。

需要注意的是,我一直在寻找解决方案,但似乎找不到任何明确答案。

当前代码:

// @nuget: HtmlAgilityPack
using System;
using System.Data;
using System.Data.SqlClient;
using System.Net;
using HtmlAgilityPack;

public class Program
{
public static void Main()
{

ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls
| SecurityProtocolType.Tls11
| SecurityProtocolType.Tls12
| SecurityProtocolType.Ssl3;
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load("https://www.g2crowd.com/products/google-analytics/reviews");
//  var divNodes = html.DocumentNode.SelectNodes("//div[@class='mb-2 border-bottom']");
var divNodes = html.DocumentNode.SelectNodes(@"//div[@itemprop='reviewBody']");
if (divNodes != null)
{
foreach (var tag in divNodes)
{
string review = tag.InnerText;
review = review.Replace("What do you like best?", "What do you like best?n");
review = review.Replace("What do you dislike?", "nWhat do you dislike?n");
review = review.Replace("Recommendations to others considering the product", "nnRecommendations to others considering the productn");
review = review.Replace("What business problems are you solving with the product?  What benefits have you realized?", "nnWhat business problems are you solving with the product?  What benefits have you realized?n");
Console.WriteLine(review);
Console.WriteLine("n------------------------------- Review found. Adding to Database -------------------------------n");
review = review.Replace("'", "");
review = review.Replace("n", "<br />");
}
}
}
}

下一个链接如下:

//link[@rel=next]

只要继续关注它,直到它不再存在。

最新更新