如何在ABOT C#网络爬虫中获取html输出页面



我正在尝试在 c# 中使用 ABOT 制作网络爬虫.我已经搜索了许多示例并添加了 ABOT 网络爬虫。 从中我只能获得日志输出而不是 Html 页面输出。我只想获取 HTML 页面输出。因为该 HTML 输出是 HTML 敏捷工具的输入。帮助我在 C# 中从 ABOT 网络爬虫中获取 HTML 输出。谢谢。

在快速入门页面上进行了说明

//Create an instance of the crawler and subscribe to the PageCrawlCompleted event
PoliteWebCrawler crawler = new PoliteWebCrawler();
crawler.PageCrawlCompleted += crawler_ProcessPageCrawlCompleted;
//The event handler method
void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
    CrawledPage crawledPage = e.CrawledPage;
    if (crawledPage.WebException != null || crawledPage.HttpWebResponse.StatusCode != HttpStatusCode.OK)
        Console.WriteLine("Crawl of page failed {0}", crawledPage.Uri.AbsoluteUri);
    else
        Console.WriteLine("Crawl of page succeeded {0}", crawledPage.Uri.AbsoluteUri);

    //crawledPage.Content.Text //raw html
    //crawledPage.HtmlDocument //lazy loaded html agility pack object (HtmlAgilityPack.HtmlDocument)
    //crawledPage.CSDocument   //lazy loaded cs query object (CsQuery.Cq)
}
void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
    CrawledPage crawledPage = e.CrawledPage;
    crawledPage.Content.Text // HTML
}

要获取 htmlpage 仅使用:

crawledPage.Content

函数内部

`static void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)`

例如:

static void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
    {
        CrawledPage crawledPage = e.CrawledPage;
        if (crawledPage.WebException != null || crawledPage.HttpWebResponse.StatusCode != HttpStatusCode.OK)
            Console.WriteLine("Crawl of page failed {0}", crawledPage.Uri.AbsoluteUri);
        else
            Console.WriteLine("Crawl of page succeeded {0}", crawledPage.Uri.AbsoluteUri);
        if (string.IsNullOrEmpty(crawledPage.Content.Text))
            Console.WriteLine("Page had no content {0}", crawledPage.Uri.AbsoluteUri);
        var htmlAgilityPackDocument = crawledPage.HtmlDocument; //Html Agility Pack parser
        var angleSharpHtmlDocument = crawledPage.AngleSharpHtmlDocument; 
        //get content
        Console.WriteLine(crawledPage.Content);

    }

最新更新