<atom:link rel="self" href="http://www.independent.co.uk/"/>
<item>
<title>
Coronavirus: Why the Covid-19 economic stimulus deal will make it to Trump's desk
</title>
<link>
https://www.independent.co.uk/news/world/americas/us-politics/coronavirus-economic-stimulus-deal-covid-19-trump-bill-senate-house-a9419976.html
</link>
<description>
<![CDATA[
News Analysis: When Senate tries to pass major bills, there's always one day of chaos. Monday appears to be that day.
]]>
</description>
对于上面的内容,我想提取标题、链接和描述我如何制定正则表达式规则来捕获这一点?
最终目标是将提取的内容转储到我创建的预定义sqldb中
正如评论中所建议的那样,您很可能应该使用XML解析器而不是regex,但由于RSS提要的格式可能是一致的,而且非常简单,regex解决方案也可以工作。
对于当前示例,您可以使用:
<(.+)>s*(?:<![CDATA[)?s*(.*)s*(?:]]>)?s*</1>
说明:
<(.+)>
-匹配开头标签,捕获名称s*
-匹配可选的空白字符(示例中的新行((?:<![CDATA[)?
-<![CDATA[
的非捕获组,匹配0或1次s*
-匹配可选的空白字符(.*)
-将捕获任何字符的捕获组s*
-匹配可选的空白字符(?:]]>)?
-]]>
的非捕获组(CDATA关闭(,匹配0或1次s*
-匹配可选的空白字符</1>
-匹配与开始标签同名的结束标签(对第一个捕获组的反向引用(
let input = `<title>
Coronavirus: Why the Covid-19 economic stimulus deal will make it to Trump's desk
</title>
<link>
https://www.independent.co.uk/news/world/americas/us-politics/coronavirus-economic-stimulus-deal-covid-19-trump-bill-senate-house-a9419976.html
</link>
<description>
<![CDATA[
News Analysis: When Senate tries to pass major bills, there's always one day of chaos. Monday appears to be that day.
]]>
</description>`;
let regex = /<(.+)>s*(?:<![CDATA[)?s*(.*)s*(?:]]>)?s*</1>/g;
let result;
do {
result = regex.exec(input);
if (result) {
console.log(result[1] + ": " + result[2]);
}
} while (result);