我在单个空间字符上划分句子,然后将这些术语与哈希的键匹配。我只有在100%相似的术语时才能获得匹配项,并且我正在努力找到一个可以匹配同一单词的几个事件的完美正则。例如。让我们考虑我现在有一个"拮抗"的术语,它与"拮抗"一词完全匹配,但无法与拮抗剂,拮抗或抗抗抗酸或前抗性,水力 - 抗逆邦总及区等匹配。-7与MCF7或MC-F7沉默了特殊字符的效果等。
这是我到目前为止的代码;THR评论的部分是我挣扎的地方。
(注意:哈希中的术语被驱动为单词的根形式)。
use warnings;
use strict;
use Drug;
use Stop;
open IN, "sample.txt" or die "cannot find sample";
open OUT, ">sample1.txt" or die "cannot find sample";
while (<IN>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my @full = ();
if ( $line =~ /<Sentence.*>(.*)</Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/,/ , /g;
$string =~ s/./ . /g;
$string =~ s/;/ ; /g;
$string =~ s/(/ ( /g;
$string =~ s/)/ )/g;
$string =~ s/:/ : /g;
$string =~ s/::/ :: )/g;
my @array = split / /, $string;
foreach my $word (@array) {
chomp $word;
if ( $word =~ /,|;|.|(|)/g ) {
push( @full, $word );
}
if ( $Stop_words{$word} ) {
push( @full, $word );
}
if ( $Values{$word} ) {
my $term = "<Drug>$word</Drug>";
push( @full, $term );
}
else {
push( @full, $word );
}
# if($word=~/.*Q$Values{$word}E/i)#Changed this
# {
# $term="<Drug>$word</$Drug>";
# print $term,"n";
# push(@full,$term);
# }
}
}
my $mod_str = join( " ", @full );
print OUT $mod_str, "n";
}
我需要一条正则以匹配MCF-7之类的单词与MCF7或 MC-F7
最直接的方法是剥离连字符,即
my $ignore_these = "[-_']"
$word =~ s{$ignore_these}{}g;
我不确定在您的价值哈希中存储了什么,所以很难说出您期望发生的事情
if($word=~/.*Q$Values{$word}E/i)
但是,您想象的是(有些简化您的代码)
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use 5.10.0;
use Data::Dumper;
while (<>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my @full = ();
if ( $line =~ /<Sentence.*>(.*)</Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/([,.;():])/ $1 /g; # squished these together
$string =~ s/::/ :: )/g; # typo in original
my @array = split /s+/, $string; # split on one /or more/ spaces
foreach my $word (@array) {
chomp $word;
my $term=$word;
my $word_chars = "[\w\-_']";
my $word_part = "antagon";
if ($word =~ m{$word_chars*?$word_part$word_chars+}) {
$term="<Drug>$word</Drug>";
}
push(@full,$term); # push
}
}
my $mod_str = join( " ", @full );
say "<Sentence>$mod_str</Sentence>";
}
这给了我以下输出,这是我对您期望的最好的猜测:
$ cat tmp.txt
<Sentence>This in antagonizing the antagonist's antagonism pre-antagonistically.</Sentence>
$ cat tmp.txt | perl x.pl
<Sentence>this in <Drug>antagonizing</Drug> the <Drug>antagonist's</Drug> <Drug>antagonism</Drug> <Drug>pre-antagonistically</Drug> .</Sentence>
$
perl -ne '$things{$1}++while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//;END{print "$_n" for sort keys %things}' FILENAME
如果文件包含以下内容:
he was an antagonist
antagonize is a verb
why are you antagonizing her?
this is an alpha-antagonist
这将返回:
alpha-antagonist
antagonist
antagonize
antagonizing
以下是一个常规(不是单线)版本:
#!/usr/bin/perl
use warnings;
use strict;
open my $in, "<", "sample.txt" or die "could not open sample.txt for reading!";
open my $out, ">", "sample1.txt" or die "could not open sample1.txt for writing!";
my %things;
while (<$in>){
$things{$1}++ while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//
}
print $out "$_n" for sort keys %things;
您可能想再看一下您对方法的假设。对我来说,听起来像是您正在寻找在单词列表的一定距离内的单词。看看Levenshtein距离公式,看看这是否是您想要的。但是请注意,计算这可能需要指数时间。