Perl 提取 </SPAN> 和 <br>之间的句子



我想提取位于SPAN和br之间的句子。我正在尝试使用HTML::TreeBuilder。我是perl的新手。任何帮助都会得到回应。

<p>
<SPAN class="verse" id="1">1 </SPAN> ଆରମ୍ଭରେ ପରମେଶ୍ବର ଆକାଶ ଓ   ପୃଥିବୀକୁ ସୃଷ୍ଟି କଲେ।
<br><SPAN class="verse" id="2">2 </SPAN> ପୃଥିବୀ ସେତବେେଳେ ସଂପୂରନ୍ଭାବେ ଶୂନ୍ଯ ଓ କିଛି ନଥିଲା। ଜଳଭାଗ ଉପରେ ଅନ୍ଧକାର ଘାଡ଼ଇେେ ରଖିଥିଲା ଏବଂ ପରମେଶ୍ବରଙ୍କର ଆତ୍ମା ଜଳଭାଗ
<br><SPAN class="verse" id="3">3 </SPAN> ଉପରେ ବ୍ଯାପ୍ତ ଥିଲା।
<br><SPAN class="verse" id="4">4 </SPAN> ପରମେଶ୍ବର ଆଲୋକକୁ ଦେଖିଲେ ଏବଂ ସେ ଜାଣିଲେ, ତାହା ଉତ୍ତମ, ଏହାପ ରେ ପରମେଶ୍ବର ଆଲୋକକୁ ଅନ୍ଧକାରରୁ ଅଲଗା କଲେ।
</p>

我做了什么

foreach $line (@lines)
{
# Now create a new tree to parse the HTML from String $str
my $tr = HTML::TreeBuilder->new_from_content($line);
# And now find all <p> tags and create an array with the values.
my @lists = 
map { $_->content_list } 
$tr->find_by_tag_name('p');
# And loop through the array returning our values.
foreach my $val (@lists) {
print $val, "n";printf FILE1  "n%s", $val ;
}   

}

我不能跳过那些嵌套在p标签中的html标签。我只想提取unicode文本并跳过嵌套标记。

我会使用XML::Twig,因为我很熟悉它。在引擎盖下,它使用HTML::TreeBuilder将HTML转换为XHTML。

你的问题的一个简单解决方案是:

#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
binmode( STDOUT, ':utf8'); # to avoid warnings when printing out wide (multi-byte) characters

my $file= shift @ARGV;
my $t= XML::Twig->new->parsefile_html( $file);
foreach my $p ($t->descendants( 'p'))
{ $p->cut_children( 'span');              # HTML::TreeBuilder lowercases tags
my @texts= $p->children_text( '#TEXT'); # just get the text
print join "---n", @texts;             # or do whatever with the text
}

您当然可以使用regexp:-)

while ( $html =~ s!<span[^>]*>.*?</span>([^>]*)<br>!$1! ){
my $text = $1;
}

使用regexp修复原始代码仍然很容易。

# And loop through the array returning our values.
foreach my $val (@lists) {
$val =~ s!<[^>]*>!!gis;
print $val, "n";printf FILE1  "n%s", $val ;
}  

Regexp并不邪恶:http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

正则表达式就像一种特别辣的辣酱——适度使用,只有在适当的时候才使用。

最新更新