查找重复项的最佳性能算法是什么



让我们看看我的代码:

function checkForDuplicates() {            
$data = $this->input->post();
$project_id = $data['project_id'];
$this->db->where('project_id', $project_id);
$paper = $this->db->get('paper')->result();
$paper2 = $paper; //duplica o array de papers
$duplicatesCount = 0;
foreach($paper as $p){
$similarity = null;
foreach($paper2 as $p2){
if($p -> status_selection_id !== 4 && $p2 -> status_selection_id !== 4){ 
if($p -> paper_id !== $p2 -> paper_id){ 
similar_text($p -> title, $p2 -> title, $similarity);
if ($similarity > 90) { 
$p -> status_selection_id = 4;
$this->db->where('paper_id', $p -> paper_id);
$this->db->update('paper', $p);
$duplicatesCount ++;
}
}
}
}
}
$data = array(
'duplicatesCount' => $duplicatesCount,
'message' => 'Duplicates where found!'
);
echo json_encode($data);
}
  1. similar_text需要180秒来检查1500条记录
  2. Levenstein检查1500条记录需要101秒
  3. 如果($pp1===$pp2(需要45秒来检查1500条记录

检查重复记录并更改其状态的最快方法是什么

优化通常是减少IO。

在您的情况下,减少SQL查询的数量应该可以缩短处理时间。

如果你需要处理大量的记录,你应该把它分成块。每个区块应该包含一批可以放入内存(RAM(的记录。

从数据库中检索您的区块。处理您的区块(即使用循环(,并使用数组(即(跟踪您需要在DB中进行的更改。最后,使用尽可能少的查询批量更新数据库。

$data = $this->input->post();
$project_id = $data['project_id'];
$this->db->where('project_id', $project_id);
$paper = $this->db->get('paper')->result();
$paper2 = $paper; //duplica o array de papers
$duplicatesCount = 0;
// keep track of updates
$updates = [];
foreach($paper as $p){
$similarity = null;
foreach($paper2 as $p2){
if($p -> status_selection_id !== 4 && $p2 -> status_selection_id !== 4){ 
if($p -> paper_id !== $p2 -> paper_id){ 
similar_text($p -> title, $p2 -> title, $similarity);
if ($similarity > 90) { 
$updates[] = [
'paper_id' => $p -> paper_id,
'status_selection_id' => 4,
];
$duplicatesCount ++;
}
}
}
}
}
if ($duplicatesCount > 0) {
// here you have to create a big SQL request with all the updates
// maybe your DB adaptor can do it for you ?
$query = $this->db->somethingToCreateABulkQuery();
foreach ($updates as $update) {
// stuff 
$query->somethingToAddAndUpdate($update);
}
$this->db->somethingToExecuteTheQuery($query);
}

相关内容

  • 没有找到相关文章

最新更新