从html敏捷包中筛选字符串



我从URL中获取html,然后选择元素table,并选择table中的所有tr元素,这些元素的id属性值中包含tr。现在我有大约20个这样的元素:

<th class="nw">1 Jan</th><td class="nw">Friday</td><td><a href="/holidays/andorra/new-year-day">New Year&#39;s Day</a></td><td>National holiday</td>

如何从上面的元素中分别获得每个文本
示例输出:1 Jan/Friday/New Year's Day/National holiday

var url = "https://www.timeanddate.com/holidays/andorra/";
var client = new HttpClient();
client.DefaultRequestHeaders.Add("Accept-Language", "en-US,en;q=0.5");
var html = await client.GetStringAsync(url);
var document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);
var a1 = document.DocumentNode.Descendants("table")
.Where(node => node.GetAttributeValue("id","").Equals("holidays-table"))
.ToList();
var a2 = a1[0].Descendants("tr")
.Where(node => node.GetAttributeValue("id","").Contains("tr"))
.ToList();

这应该会给你想要的:

List<List<string>> holidays = document
.DocumentNode
.SelectNodes("//table[@id='holidays-table']/tbody/tr")
.Select(tr => tr.ChildNodes
.Where(n => n.Name == "th" || n.Name == "td")
.Select(n => n.InnerText.Trim())
.ToList())
.Where(row => row.Any())  // filter out empty rows
.ToList();
foreach (var row in holidays)
{
Console.WriteLine(string.Join(", ", row));
}

在此处进行演示:https://dotnetfiddle.net/0SADls

最新更新