无论Perl是TRUE还是FALSE，R中的gregexpr函数都会返回不同的结果

我有下面一段HTML，我正试图用R 中的gregexpr函数运行regex

<div class=g-unit>
<div class=nwp style=display:inline>
<input type=hidden name=cid value="22144">
<input autocomplete=off class=id-fromdate type=text size=10 name=startdate value="Sep 6, 2013"> -
<input autocomplete=off class=id-todate type=text size=10 name=enddate value="Sep 5, 2014">
<input id=hfs type=submit value=Update style="height:1.9em; margin:0 0 0 0.3em;">
</div>
</div>
</div>
<div id=prices class="gf-table-wrapper sfe-break-bottom-16">
<table class="gf-table historical_price">
<tr class=bb>
<th class="bb lm lft">Date
<th class="rgt bb">Open
<th class="rgt bb">High
<th class="rgt bb">Low
<th class="rgt bb">Close
<th class="rgt bb rm">Volume
<tr>
...
...
</table>
</div>

我正试图使用以下正则表达式从这个html中提取表部分

<table\s+class="gf-table historical_price">.+<

当我用perl=FALSE运行gregexpr函数时，它运行得很好，我得到了一个结果然而，如果我用perl=TRUE运行它，我什么也得不到。它似乎与不匹配

有人知道为什么结果与打开和关闭Perl不同吗？非常感谢！

在regex的扩展模式中，点似乎能够匹配换行符，而在perl模式中则不是这样。要使其在perl模式下工作，您需要使用(?s)修饰符使点也能够匹配换行符：

> m <- gregexpr('(?s)<table\s+class="gf-table historical_price">.+</table>', str, perl = TRUE)

在许多regex风格中，点默认情况下与换行符不匹配，这可能是为了使逐行作业更加方便。

内联修饰符(?s)中的s代表"单线"。换句话说，这意味着即使有换行符，整个字符串也被视为一行（对于点）。

您需要使用内联(?s)修饰符来强制点匹配所有字符，包括换行符。

perl=T参数切换到实现正则表达式模式匹配的（PCRE）库。

gregexpr('(?s)<table\s+class="gf-table historical_price">.+</table>', x, perl=T)

但是，正如注释中所述，建议使用解析器来执行此操作。我一开始会使用XML库。

cat(paste(xpathSApply(htmlParse(html), '//table[@class="gf-table historical_price"]', xmlValue), collapse = "n"))

相关内容

最新更新

热门标签：