Regex解析XML-RSS提要


<atom:link rel="self" href="http://www.independent.co.uk/"/>
<item>
<title>
Coronavirus: Why the Covid-19 economic stimulus deal will make it to Trump&apos;s desk
</title>
<link>
https://www.independent.co.uk/news/world/americas/us-politics/coronavirus-economic-stimulus-deal-covid-19-trump-bill-senate-house-a9419976.html
</link>
<description>
<![CDATA[
News Analysis: When Senate tries to pass major bills, there's always one day of chaos. Monday appears to be that day.
]]>
</description>

对于上面的内容,我想提取标题、链接和描述我如何制定正则表达式规则来捕获这一点?

最终目标是将提取的内容转储到我创建的预定义sqldb中

正如评论中所建议的那样,您很可能应该使用XML解析器而不是regex,但由于RSS提要的格式可能是一致的,而且非常简单,regex解决方案也可以工作。

对于当前示例,您可以使用:

<(.+)>s*(?:<![CDATA[)?s*(.*)s*(?:]]>)?s*</1>

说明:

  • <(.+)>-匹配开头标签,捕获名称
  • s*-匹配可选的空白字符(示例中的新行(
  • (?:<![CDATA[)?-<![CDATA[的非捕获组,匹配0或1次
  • s*-匹配可选的空白字符
  • (.*)-将捕获任何字符的捕获组
  • s*-匹配可选的空白字符
  • (?:]]>)?-]]>的非捕获组(CDATA关闭(,匹配0或1次
  • s*-匹配可选的空白字符
  • </1>-匹配与开始标签同名的结束标签(对第一个捕获组的反向引用(

let input = `<title>
Coronavirus: Why the Covid-19 economic stimulus deal will make it to Trump&apos;s desk
</title>
<link>
https://www.independent.co.uk/news/world/americas/us-politics/coronavirus-economic-stimulus-deal-covid-19-trump-bill-senate-house-a9419976.html
</link>
<description>
<![CDATA[
News Analysis: When Senate tries to pass major bills, there's always one day of chaos. Monday appears to be that day.
]]>
</description>`;
let regex = /<(.+)>s*(?:<![CDATA[)?s*(.*)s*(?:]]>)?s*</1>/g;
let result;
do {
result = regex.exec(input);
if (result) {
console.log(result[1] + ": " + result[2]);
}
} while (result);

最新更新