我正在尝试使用levensteinSim基于其他列值比较列中的文本。
我的样本数据为:
d1 d2 d3
A 100 This is not a test for the project
A 100 This is not a test for the project
A 100 This is not a test for the project
A 300 This is test for the project
A 300 This is test for the project
A 300 This is test for the project
A 400 This is test for the project XYX
A 500 This is not a test for the project
B 10 This is a new project
B 20 This is about vegetables
B 30 This is about animals
B 10 This is a new project
B 20 This is about vegetables
B 30 This is about animals
B 10 This is a new project
B 20 This is about vegetables
B 30 This is about animals
我想比较D3中的文本,并获得基于d1和d2的百分比匹配。
在那之后,我无法应用levensteinSim来获得基于d1和d2的d3中文本的%匹配。
样本输出将类似于:
d1 d2 d3 match_percentage
A 100 This is not a test for the project 100%
A 300 This is test for the project 56%
等等d3的值与相同d1的所有其它值进行比较。
样本代码:
首先,我从df:中找到了唯一的记录
abc <- read.csv("Duplicate_test.csv",header = TRUE)
def<-abc %>%
group_by(d1,d2) %>%
mutate(num_dups = n(),
dup_id = row_number()) %>%
ungroup() %>%
mutate(is_duplicated = dup_id > 1)
unique_records <- filter(def,is_duplicated == FALSE)
您可以使用distinct
来获得唯一的行,并为每个d1
计算d3
值之间的max
匹配百分比。
library(dplyr)
abc %>%
distinct() %>%
group_by(d1) %>%
mutate(match_percentage = map_dbl(row_number(),
~max(RecordLinkage::levenshteinSim(d3[.x], d3[-.x]))) * 100)
# d1 d2 d3 match_percentage
# <chr> <int> <chr> <dbl>
#1 A 100 This is not a test for the project 100
#2 A 300 This is test for the project 87.5
#3 A 400 This is test for the project XYX 87.5
#4 A 500 This is not a test for the project 100
#5 B 10 This is a new project 45.5
#6 B 20 This is about vegetables 70.8
#7 B 30 This is about animals 70.8
数据
abc <- structure(list(d1 = c("A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B", "B"), d2 = c(100L, 100L,
100L, 300L, 300L, 300L, 400L, 500L, 10L, 20L, 30L, 10L, 20L,
30L, 10L, 20L, 30L), d3 = c("This is not a test for the project",
"This is not a test for the project", "This is not a test for the project",
"This is test for the project", "This is test for the project",
"This is test for the project", "This is test for the project XYX",
"This is not a test for the project", "This is a new project",
"This is about vegetables", "This is about animals", "This is a new project",
"This is about vegetables", "This is about animals", "This is a new project",
"This is about vegetables", "This is about animals")),
class = "data.frame", row.names = c(NA, -17L))