如何在 php 中从多个文本文件中计算 tf-idf



我成功地从数组中计算了tf-idf。现在我希望 tf-idf 应该从多个文本文件中计算出来,因为我的目录中有多个文本文件。任何人都可以为多个文本文件修改此代码,以便首先读取目录中的所有文件,然后根据这些文件的内容计算tf-idf。下面是我的代码谢谢...

$collection = array(
    1 => 'this string is a short string but a good string',
    2 => 'this one isn't quite like the rest but is here',
    3 => 'this is a different short string that' not as short'
);
$dictionary = array();
$docCount = array();
foreach($collection as $docID => $doc) {
    $terms = explode(' ', $doc);
    $docCount[$docID] = count($terms);
    foreach($terms as $term) {
        if(!isset($dictionary[$term])) {
            $dictionary[$term] = array('df' => 0, 'postings' => array());
        }
        if(!isset($dictionary[$term]['postings'][$docID])) {
            $dictionary[$term]['df']++;
            $dictionary[$term]['postings'][$docID] = array('tf' => 0);
        }
        $dictionary[$term]['postings'][$docID]['tf']++;
    }
}
$temp = ('docCount' => $docCount, 'dictionary' => $dictionary);

计算 tf-idf

$index = $temp;
$docCount = count($index['docCount']);
$entry = $index['dictionary'][$term];
foreach($entry['postings'] as  $docID => $postings) {
    echo "Document $docID and term $term give TFIDF: " .
        ($postings['tf'] * log($docCount / $entry['df'], 2));
    echo "n";
}

看看这个答案: 从目录中读取所有文件内容 - php

在那里,您可以找到如何从目录中读取所有文件内容的信息。
有了这些信息,您应该能够自行修改代码,使其按预期工作。

最新更新