最小化模糊匹配和 for 循环中的计算次数

我目前正在尝试使用模糊匹配在大型数据集(500,000+ 行(中查找一些潜在的重复项。此代码有三个主要部分：

我编写的一个函数，用于识别数据集中最像潜在重复项(通过返回分数 - 它选择最高分数(。
标识最有可能重复的记录位置的函数。
一个 for 循环，它对每条记录运行上述两个函数，并返回DupScore列和positionBestMatch列中的值。

生成的数据集示例如下：

Name:     DOB:         DupScore    positionbestMatch
Ben       6/3/1994     15          3
Abe       5/5/2005     11          5
Benjamin  6/3/1994     15          1 
Gabby     01/01/1900   10          6
Abraham   5/5/2005     11          2
Gabriella 01/01/1900   10          4

计算这些分数的 for 循环看起来有点像这样(scorefunc和position func是自的书面函数(：

for (i in c(1:length(df$Name))) {
df$dupScore[i]<-scorefunc[i]
df$positionBestMatch[i]<-positionfunc[i]
}

显然，在具有如此多行的数据集上，此循环在遍历每一行时非常耗时且计算密集。如何编辑我的 for 循环，以便：

当为一行计算DupScore时，它还会将分数不仅插入[i]行，还会插入positionbestMatch行？
并让循环仅针对具有空DupScore和positionBestMatch值的那些运行。

我希望这是有道理的！

尝试使用while循环

all_inds <- seq_len(nrow(df))
i <- all_inds[1]
while (length(all_inds) > 1) {
i <- all_inds[1]
df$dupScore[i]<-scorefunc[i]
df$positionBestMatch[i]<-positionfunc[i]
df$dupScore[df$positionBestMatch[i]] <- df$dupScore[i] 
all_inds <- setdiff(all_inds, c(i, df$positionBestMatch[i]))
}

但这将为df$positionBestMatch保留一些空值。

相关内容

最新更新

热门标签：