在字符串中搜索带的字符串(搜索HTML源代码中的所有href)

我有一个字符串变量，它包含网页的整个HTML。该网页将包含指向其他网站的链接。我想创建一个所有href的列表（类似于网络爬虫）。最好的方法是什么？使用任何扩展功能会有帮助吗？使用Regex怎么样？

提前感谢

使用DOM解析器（如HTML敏捷包）解析文档并查找所有链接。

关于如何使用这里提供的HTML敏捷包，SO上有一个很好的问题。这里有一个简单的例子让你开始：

string html = "your HTML here";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var links = doc.DocumentNodes.DescendantNodes()
   .Where(n => n.Name == "a" && n.Attributes.Contains("href")
   .Select(n => n.Attributes["href"].Value);

我想你会发现这回答了你对T 的问题

http://msdn.microsoft.com/en-us/library/t9e807fx.aspx

：）

我会选择Regex。

        Regex exp = new Regex(
            @"{href=}*{>}",
            RegexOptions.IgnoreCase);
        string InputText; //supply with HTTP
        MatchCollection MatchList = exp.Matches(InputText);

试试这个Regex（应该有效）：

var matches = Regex.Matches (html, @"href=""(.+?)""");

您可以浏览匹配项并提取捕获的URL。

您是否研究过使用HTMLAGILITYACK？http://htmlagilitypack.codeplex.com/

有了这个，你可以简单地使用XPATH获取页面上的所有链接，并将它们放入列表中。

private List<string> ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
    List<string> hrefTags = new List<string>();
    foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
    {
        HtmlAttribute att = link.Attributes["href"];
        hrefTags.Add(att.Value);
    }
    return hrefTags;
}

摘自另一篇文章-获取html页面上的所有链接？

相关内容

最新更新

热门标签：