我有一些工作代码,我尝试使用dreamincode教程进行多线程处理:http://www.dreamincode.net/forums/topic/255487-multithreading-in-perl/
那里的示例代码似乎工作正常,但我一生都无法弄清楚为什么我的代码不能。从放入调试消息开始,它似乎一直到所有线程的子例程的末尾,然后在那里坐了一段时间,然后遇到分段错误并转储核心。话虽如此,我也没有设法在任何地方找到核心转储文件(Ubuntu 13.10)。
如果有人有任何建议阅读,或者可以在下面相当混乱的代码中看到错误,我将永远感激不尽。
#!/usr/bin/env perl
use Email::Valid;
use LWP::Simple;
use XML::LibXML;
use Text::Trim;
use threads;
use DB_File;
use Getopt::Long;
my $sourcefile = "thislevel.csv";
my $startOffset = 0;
my $chunk = 10000;
my $num_threads = 8;
$result = GetOptions ("start=i" => $startOffset, # numeric
"chunk=i" => $chunk, # numeric
"file=s" => $sourcefile, # string
"threads=i" => $num_threads, #numeric
"verbose" => $verbose); # flag
$tie = tie(@filedata, "DB_File", $sourcefile, O_RDWR, 0666, $DB_RECNO)
or die "Cannot open file $sourcefile: $!n";
my $filenumlines = $tie->length;
if ($filenumlines>$startOffset + $chunk){
$numlines = $startOffset + $chunk;
} else {
$numlines = $filenumlines;
}
open (emails, '>>emails.csv');
open (errorfile, '>>errors.csv');
open (nxtlvl, '>>nextlevel.csv');
open (donefile, '>>donelines.csv');
my $line = '';
my $found = false;
my $linenum=0;
my @threads = initThreads();
foreach(@threads){
$_ = threads->create(&do_search);
}
foreach(@threads){
$_->join();
}
close nxtlvl;
close emails;
close errorfile;
close donefile;
sub initThreads{
# An array to place our threads in
my @initThreads;
for(my $i = 1;$i<=$num_threads;$i++){
push(@initThreads,$i);
}
return @initThreads;
}
sub do_search{
my $id = threads->tid();
my $linenum=$startOffset-1+$id;
my $parser = XML::LibXML->new();
$parser->set_options({ recover => 2,
validation => 0,
suppress_errors => 1,
suppress_warnings => 1,
pedantic_parser => 0,
load_ext_dtd => 0, });
while ($linenum < $numlines) {
$found = false;
@full_line = split ',', $filedata[$linenum-1];
$line = trim(@full_line[1]);
$this_url = trim(@full_line[2]);
print "Thread $id Scanning $linenum of $filenumlines: ";
printf "%.3f%n", 100 * $linenum / $filenumlines;
my $content = get trim($this_url);
if (!defined($content)) {
print errorfile "$this_url, no contentn";
}elsif (length($content)<100) {
print errorfile "$this_url, shortn";
}else {
my $doc = $parser->load_html(string => $content);
if(defined($doc)){
for my $anchor ( $doc->findnodes("//a[@href]") )
{
$is_email = substr $anchor->getAttribute("href") ,7;
if(Email::Valid->address($is_email)) {
printf emails "%s, %sn", $line, $is_email;
$found = true;
} else{
$link = $anchor->getAttribute("href");
if (substr lc(trim($link)),0,4 eq "http"){
printf nxtlvl "%s, %sn", $line, $link;
} else {
printf nxtlvl "%s, %s/%sn", $line, $line, $link;
}
}
}
}
if ($found=false){
my @lines = split 'n',$content;
foreach my $cline (@lines){
my @words = split ' ',$cline;
foreach my $word (@words) {
my @subwords = split '"',$word ;
foreach my $subword (@subwords) {
if(Email::Valid->address($subword)) {
printf emails "%s, %sn", $line, $subword;
}
}
}
}
}
}
printf donefile "%sn",$linenum;
$linenum = $linenum + $num_threads;
}
threads->exit();
}
乱的编码错误意味着我的代码永远不应该被用作其他访问者的示例之外,DB_File
不是一个线程安全的模块。
令人讨厌的是,也许是误导性的,它绝对可以正常工作,直到您关闭在整个代码中成功访问该文件的线程。