根据R中的多个字段比较列值中的字符串值

我正在尝试使用levensteinSim基于其他列值比较列中的文本。

我的样本数据为：

d1  d2  d3  
A   100 This is not a test for the project  
A   100 This is not a test for the project  
A   100 This is not a test for the project  
A   300 This is test for the project  
A   300 This is test for the project  
A   300 This is test for the project  
A   400 This is test for the project XYX  
A   500 This is not a test for the project  
B   10  This is  a new project  
B   20  This is about vegetables  
B   30  This is about animals  
B   10  This is  a new project  
B   20  This is about vegetables  
B   30  This is about animals  
B   10  This is  a new project  
B   20  This is about vegetables  
B   30  This is about animals

我想比较D3中的文本，并获得基于d1和d2的百分比匹配。

在那之后，我无法应用levensteinSim来获得基于d1和d2的d3中文本的%匹配。

样本输出将类似于：

d1 d2 d3 match_percentage
A 100 This is not a test for the project  100%    
A 300 This is test for the project         56%

等等d3的值与相同d1的所有其它值进行比较。

样本代码：

首先，我从df:中找到了唯一的记录

abc <- read.csv("Duplicate_test.csv",header = TRUE)
def<-abc  %>%
group_by(d1,d2) %>%
mutate(num_dups = n(),
dup_id = row_number()) %>%
ungroup() %>%
mutate(is_duplicated = dup_id > 1)
unique_records <- filter(def,is_duplicated == FALSE)

您可以使用distinct来获得唯一的行，并为每个d1计算d3值之间的max匹配百分比。

library(dplyr)
abc %>%
distinct() %>%
group_by(d1) %>%
mutate(match_percentage = map_dbl(row_number(), 
~max(RecordLinkage::levenshteinSim(d3[.x], d3[-.x]))) * 100)
#   d1       d2 d3                                 match_percentage
#  <chr> <int> <chr>                                         <dbl>
#1 A       100 This is not a test for the project            100  
#2 A       300 This is test for the project                   87.5
#3 A       400 This is test for the project XYX               87.5
#4 A       500 This is not a test for the project            100  
#5 B        10 This is  a new project                         45.5
#6 B        20 This is about vegetables                       70.8
#7 B        30 This is about animals                          70.8

数据

abc <- structure(list(d1 = c("A", "A", "A", "A", "A", "A", "A", "A", 
"B", "B", "B", "B", "B", "B", "B", "B", "B"), d2 = c(100L, 100L, 
100L, 300L, 300L, 300L, 400L, 500L, 10L, 20L, 30L, 10L, 20L, 
30L, 10L, 20L, 30L), d3 = c("This is not a test for the project", 
"This is not a test for the project", "This is not a test for the project", 
"This is test for the project", "This is test for the project", 
"This is test for the project", "This is test for the project XYX", 
"This is not a test for the project", "This is  a new project", 
"This is about vegetables", "This is about animals", "This is  a new project", 
"This is about vegetables", "This is about animals", "This is  a new project", 
"This is about vegetables", "This is about animals")), 
class = "data.frame", row.names = c(NA, -17L))

相关内容

最新更新

热门标签：