无法获得加权余弦相似性来工作



>我正在尝试获取两个文档的加权余弦相似性。我正在使用 Text::D ocument 和 Text::D ocumentCollection。我的代码似乎有效,但它没有像我预期的那样返回数字。

这是我的代码

use strict;
use warnings;
use Text::Document;
use Text::DocumentCollection;
my $newfile  = shift @ARGV;
my $newfile2 = shift @ARGV;
##This is in another file.
my $t1 = countFreq($newfile);
my $t2 = countFreq($newfile2);
my $collection = Text::DocumentCollection->new(file => 'coll.db');
$collection->Add("One", $t1);
$collection->Add("Two", $t2);
my $wSim = $t1->WeightedCosineSimilarity( $t2,
    &Text::DocumentCollection::IDF,
    $collection
);
print "nWeighted Cosine Sim is: $wSimn";

所有这些返回都是Weighted Cosine Sim is:的,冒号后面没有任何内容。

以下是countFreq的代码:

sub countFreq{
my ($file) = @_;
my $t1 = Text::Document->new();
open (my $info, $file) or die "Could not open  file.";
    while (my $line = <$info>) {
        chomp $line;
        $line =~ s/[[:punct:]]//g;
    foreach my $str (split /s+/, $line) {
        if (!defined $sp{lc($str)}) {
            $t1 -> AddContent ($str);
    }
}
}
    return $t1;
}

###Update

这是一个运行良好的示例程序。它基于查看发行版中的测试代码以获得灵感

我预计测试的灵敏度要低得多,所以我从两个截然不同的文本来源得到零。本示例将三个短句$d1$d1$d3 添加到集合$c,然后将这三个文档中的每一个进行比较以$d1

$d1与自身进行比较会产生 1 - 正如预期的那样完全匹配,而比较 $d2$d3 分别给出 0.087 和 0 - 部分匹配和根本不匹配

我希望这可以帮助您解决您的特定问题?

use strict;
use warnings 'all';
use Text::Document;
use Text::DocumentCollection;
my $d1 = Text::Document->new;
$d1->AddContent( 'my heart belongs to sally webster' );
my $d2 = Text::Document->new;
$d2->AddContent( 'my heart belongs to the girl next door' );
my $d3 = Text::Document->new;
$d3->AddContent( 'I want nothing to do with my neighbours' );
my $c = Text::DocumentCollection->new( file => 'coll2.db' );
$c->Add('one',   $d1);
$c->Add('two',   $d2);
$c->Add('three', $d3);
for my $doc ( $d1, $d2, $d3 ) {
    my $wcs = $d1->WeightedCosineSimilarity(
        $doc,
        &Text::DocumentCollection::IDF,
        $c
    );
    die qq{Invalid parameters for "WeightedCosineSimilarity"} unless defined $wcs;
    print $wcs, "n";
}
#

##output

1
0.0874311036726221
0

<小时 />

这是Text::Document::WeightedCosineSimilarity的代码

# this is rather rough
sub WeightedCosineSimilarity
{
    my $self = shift;
    my ($e,$weightFunction,$rock) = @_;
    my ($Dv,$Ev) = ($self->{terms}, $e->{terms});
# compute union
    my %union =  %{$self->{terms}};
    my @keyse = keys %{$e->{terms}};
    @union{@keyse} = @keyse;
    my @allkeys = keys %union;
# weighted D
    my @Dw = map(( defined( $Dv->{$_} )?
        &{$weightFunction}( $rock, $_ )*$Dv->{$_} : 0.0 ),
        @allkeys
    );
# weighted E
    my @Ew = map(( defined( $Ev->{$_} )?
        &{$weightFunction}( $rock, $_ )*$Ev->{$_} : 0.0 ),
        @allkeys
    );
# dot product of D and E
    my $dotProduct = 0.0;
    map( $dotProduct += $Dw[$_] * $Ew[$_] , 0..$#Dw );
# norm of D
    my $nD = 0.0;
    map( $nD += $Dw[$_] * $Dw[$_] , 0..$#Dw );
    $nD = sqrt( $nD );
# norm of E
    my $nE = 0.0;
    map( $nE += $Ew[$_] * $Ew[$_] , 0..$#Ew );
    $nE = sqrt( $nE );
# dot product scaled by norm
    if( ($nD==0) || ($nE==0) ){
        return undef;
    } else {
        return $dotProduct / $nD / $nE;
    }
}

恐怕我不明白它背后的理论,但看起来你的问题是$nD("D 的范数")或$nE("D 的范数")为零

我只能建议您的两个文本样本可能太相似/不同,或者它们太长/太短?

无论哪种方式,您的代码都应如下所示,以便从余弦函数中捕获无效的返回值:

my $wSim = $t1->WeightedCosineSimilarity( $t2,
    &Text::DocumentCollection::IDF,
    $collection
);
die qq{Invalid parameters for "WeightedCosineSimilarity"} unless defined $wSim;
print "nWeighted Cosine Sim is: $wSimn";

最新更新