使用regex选择HTML文本元素

我想在HTML文档中查找©，基本上得到版权归属的实体。

版权线显示了几种不同的方式：

<p class="bg-copy">&copy; 2011  The New York Times Company</p>

或

<a href="http://www.nytimes.com/ref/membercenter/help/copyright.html">
&copy; 2011</a> 
<a href="http://www.nytco.com/">The New York Times Company</a>

或

<br>Published since 1996<br>Copyright &copy; CounterPunch<br>
All rights reserved.<br>

我想忽略日期和中间的标签，只得到"纽约时报公司"或"Counterpunch"。

我没有找到太多关于将regex与JavaScript或JQuery一起使用的信息，尽管我觉得它可能会导致严重的头痛。如果有更好的方法，请告诉我。

对于一个健壮的解决方案，您可能需要DOM导航和一些启发式方法的组合。您的示例可以用regex解决，但还有很多可能的场景。。。

&copy;[sd]*(?:</.+?>[^>]*>)?([^<]*)

适用于您的三个样本。但只适用于他们和类似的情况。

请参阅卢布

说明：

&copy; // copyright symbol
[sd]* // followed by spaces or digits 
(?:</.+?>[^>]*>)? // maybe followed by a closing tag and another opening one
([^<]*) // than match anything up to the next tag

关于如何在javascript中使用jquery，请参阅以下答案。基本上，您可以使用match(/regex/(函数：

var result = string.match(/&copy;[sd]*(?:</.+?>[^>]*>)?([^<]*)/)

$('*:contains(©)').filter(function(){
    return $(this).find('*:contains(©)').length == 0
}).text();

在这里测试http://jsfiddle.net/unloco/kGPYA/

相关内容

最新更新

热门标签：