我正在使用以下代码从目录中的txt文件中剥离html元素:
use strict;
use warnings;
use File::Spec;
use HTML::FormatText;
use Cwd;
my $direct = "/directory/";
opendir my $dh, $direct or die "Can't open directory";
while ( readdir $dh ) {
next if /^./;
my $file = File::Spec->catfile($direct, $_);
print $file."n";
my $outfile = File::Spec->catfile($direct, "out_$_");
next unless -f $file;
my $html = do {
open my $fh, '<', $file or die qq(Unable to open "$file" for reading: $!);
local $/;
<$fh>;
};
next unless $html =~ /<html/i;
my $formatted = HTML::FormatText->format_string(
$html, leftmargin => 0, rightmargin => 60);
open my $fh, '>', $outfile or die qq(Unable to open "$outfile" for writing: $!);
print $fh "File: $filenn";
print $fh "$formattedn";
print $fh "*" x 40, "n" ;
close $fh or die qq(Unable to close "$outfile" after writing: $!);
unlink $file or warn "Could not unlink $file: $!";
}
但在结果输出中似乎留下了许多不需要的字符:
<div style="text-align:center;"><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;margin-left:0px;">TEXT TEXT TEXT TEXT</font></div><div style="text-align:center;"><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;margin-left:0px;">TEXT TEXT TEXT TEXT</font></div><div style="text-align:center;">&#160;</div><p style='margin-top:0pt; margin-bottom:0pt'><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;margin-left:0px;">1</font><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;">. </font><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-decoration:underline;">ORGANIZATION </font><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-decoration:underline;">AND</font><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-decoration:underline;"> SUMMARY OF </font><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-decoration:underline;">SIGNIFICANT ACCOUNTING </font><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-
知道如何摆脱这些HTML/CSS吗?(但请保留这些标签中的文本)!
HTML::Parser发行版包含一个从HTML文件中提取纯文本的示例程序。
#!/usr/bin/perl -w
# Extract all plain text from an HTML file
use strict;
use HTML::Parser 3.00 ();
my %inside;
sub tag
{
my($tag, $num) = @_;
$inside{$tag} += $num;
print " "; # not for all tags
}
sub text
{
return if $inside{script} || $inside{style};
print $_[0];
}
HTML::Parser->new(api_version => 3,
handlers => [start => [&tag, "tagname, '+1'"],
end => [&tag, "tagname, '-1'"],
text => [&text, "dtext"],
],
marked_sections => 1,
)->parse_file(shift) || die "Can't open file: $!n";
如果安装了Mojolicious
,则类似于:
perl -MMojo::DOM -0 -e 'print my $dom = Mojo::DOM->new(<>)->all_text()' file.html
可能会起作用:-)
解释者:Mojo::DOM->new(<>)->all_text()
应该是不言自明的;-)。。。CCD_ 3只是从CCD_ 4上提供的内容中生成一个DOM对象,而CCD_ 5在该对象上运行CCD_ 6方法。
关于-0
开关,请参见perlun
。本质上,它是为了诋毁文件,使<>
包含整个内容(嗯……有人会在评论中纠正我)。你可以用Mojo::DOM
制作一个真正的脚本,更像Dave的答案,而不是像我的例子中那样只是一句俏皮话。