使用htmldocument/HtmlAgilityPack获取所有节点及其内容

我需要从html中获取所有节点，然后从这些节点中获取文本和子节点，以及同样的东西，但从这些子节点中获取。例如，我有这样的HTML:

<p>This <b>is a <a href="">Link</a></b> with <b>bold</b></p>

因此，我需要一种方法来获得p节点，然后是非格式化文本(this)、唯一的粗体文本(是)、粗体链接(link，以及其他格式化和非格式化文本。

我知道，使用htmldocument，我可以选择所有节点和子节点，但是，我如何获取子节点之前的文本，然后获取子节点及其文本/子节点，以便我可以制作html的渲染版本("Thisis a Linkwithbold")？

请注意，上面的例子很简单。HTML会有更复杂的东西，比如列表、框架、编号列表、三重格式文本等。还要注意，呈现的东西不是问题。我已经这样做了，但用了另一种方式。我需要的是只获取节点及其内容的部分。此外，我不能忽略任何节点，所以我不能什么都不过滤。主节点可以从p、div、frame、ul等开始。

在查看了htmldoc及其属性后，感谢@HungCao的观察，我找到了一种解释HTML代码的简单方法。

我的代码有点复杂，以添加它作为示例，所以我将发布它的精简版

首先，必须加载htmlDoc。它可以在任何功能上：

HtmlDocument htmlDoc = new HtmlDocument();
string html = @"<p>This <b>is a <a href="""">Link</a></b> with <b>bold</b></p>";
htmlDoc.LoadHtml(html);

然后，我们需要解释每个"主"节点(在本例中为p)，并且根据其类型，我们需要加载LoopFunction(InterNode)

HtmlNodeCollection nodes = htmlDoc.DocumentNode.ChildNodes;
foreach (HtmlNode node in nodes)
{
if(node.Name.ToLower() == "p") //Low the typeName just in case
{
Paragraph newPPara = new Paragraph();
foreach(HtmlNode childNode in node.ChildNodes)
{
InterNode(childNode, ref newPPara);
}
richTextBlock.Blocks.Add(newPPara);
}
}

请注意，有一个名为"NodeType"的属性，但它不会返回正确的类型。因此，请改用"Name"属性(还要注意，htmlNode中的Name属性与HTML中的Name特性不同)。

最后，我们有一个InterNode函数，它将向引用的(ref)段落添加内联

public bool InterNode(HtmlNode htmlNode, ref Paragraph originalPar)
{
string htmlNodeName = htmlNode.Name.ToLower();
List<string> nodeAttList = new List<string>();
HtmlNode parentNode = htmlNode.ParentNode;
while (parentNode != null) {
nodeAttList.Add(parentNode.Name);
parentNode = parentNode.ParentNode;
} //we need to get it multiple types, because it could be b(old) and i(talic) at the same time.
Inline newRun = new Run();
foreach (string noteAttStr in nodeAttList) //with this we can set all the attributes to the inline
{
switch (noteAttStr)
{
case ("b"):
case ("strong"):
{
newRun.FontWeight = FontWeights.Bold;
break;
}
case ("i"):
case ("em"):
{
newRun.FontStyle = FontStyle.Italic;
break;
}
}
}
if(htmlNodeName == "#text") //the #text means that its a text node. Like <i><#text/></i>. Thanks @HungCao
{
((Run)newRun).Text = htmlNode.InnerText;
} else //if it is not a #text, don't load its innertext, as it's another node and it will always have a #text node as a child (if it has any text)
{
foreach (HtmlNode childNode in htmlNode.ChildNodes)
{
InterNode(childNode, ref originalPar);
}
}
return true;
}

注意：我知道我说过我的应用程序需要以网络视图的另一种方式呈现HTML，我知道这个示例代码生成的内容与网络视图相同，但正如我之前所说，这只是我最终代码的精简版本。事实上，我的原始/完整代码正在按需要工作，这只是基础。

相关内容

最新更新

热门标签：