检查电子邮件数据中的重复名称

  • 本文关键字:电子邮件 数据 r
  • 更新时间 :
  • 英文 :


我试图检查重复的电子邮件数据,如果有重复的名字与其他电子邮件。它工作,但如果有相同名称的副本,那么它应该将所有名称反映为副本。

因此,例如,如果abcd@ddd.com有多个条目,如abcd@ccc.com或abcd@fff.com,所有三个都应该反映为重复的。

同样,如果abby.del@ddd.com有多个条目,如abby-del@ccc.com或abby_del@fff.com,则所有三个都应反映为重复的。

df <- data.frame(EMP.ID = c(88111,"BBB4477","BBB4058","BBB5832","BBB0338","BBB1814","BBB6543",875430,875970,"BBB0243","BBB1943","BBB9344","BBB9701","BBB1814","BBB8648","BBB4373","BBB7270","BBB6165","BBB7460","BBB7528","BBB6092"),
name = c("link adam","dy tt","link adam","gbesada","dojeda","slew lang","?alpucheta","r zona","jachaval","allo nyyn","mbautis","grand fring","jali","kintom dang","namoti","shan mig","NA","NA","NA","NA",NA),
email = c("link.adam@gmail.com","dy.tt@abcd.com","link_adam@gmail.com","gbesada@abcd.com","dojeda@abcd.com","?slew.lang@abcd.com","dy-tt@abcd.com","?rzona@abcd.com","jachaval@abcd.com","allo@abcd.com","mbautis@abcd.com","grand.fring@abcd.com","jali@abcd.com","kintom.dang@abcd.com","namoti@abcd.com","shan.mig@abcd.com","mbautis@XYZ.com","?slew.lang@abcd.com",NA,"NA",NA))

separator= " "
valuesToIgnore <- c(NA, NA)
df <- df %>%
mutate(across(c(name,email), tolower)) %>% 
mutate(email_name1 = str_extract(email, "([a-z.]+)(?=@.+)")) %>% 
mutate(email_name1 = str_replace_all(email_name1, "\.", separator)) %>% 
mutate(`13. duplicate name with mailid` = ifelse(duplicated(email_name1, incomparables=valuesToIgnore),"Duplicate email username exists",NA))

我已经尝试了很多解决方案,有没有永久性的解决方案来处理电子邮件数据…??

我将解决您的问题如下(也注意在正则表达式的变化):

df %>%
mutate(across(c(name,email), tolower)) %>% 
mutate(email_name1 = str_extract(email, "([^@]+)@")) %>% 
mutate(email_name1 = str_replace_all(email_name1, "[\W_]", "")) %>% %>% 
group_by(email_name1) %>% 
mutate(count = n())
# A tibble: 21 x 5
# Groups:   email_name1 [17]
EMP.ID  name        email                email_name1 count
<chr>   <chr>       <chr>                <chr>       <int>
1 88111   link adam   link.adam@gmail.com  link adam       1
2 BBB4477 dy tt       dy.tt@abcd.com       dy tt           1
3 BBB4058 link adam   link_adam@gmail.com  adam            1
4 BBB5832 gbesada     gbesada@abcd.com     gbesada         1
5 BBB0338 dojeda      dojeda@abcd.com      dojeda          1
6 BBB1814 slew lang   ?slew.lang@abcd.com  slew lang       2
7 BBB6543 ?alpucheta  dy-tt@abcd.com       tt              1
8 875430  r zona      ?rzona@abcd.com      rzona           1
9 875970  jachaval    jachaval@abcd.com    jachaval        1
10 BBB0243 allo nyyn   allo@abcd.com        allo            1
11 BBB1943 mbautis     mbautis@abcd.com     mbautis         2
12 BBB9344 grand fring grand.fring@abcd.com grand fring     1
13 BBB9701 jali        jali@abcd.com        jali            1
14 BBB1814 kintom dang kintom.dang@abcd.com kintom dang     1
15 BBB8648 namoti      namoti@abcd.com      namoti          1
16 BBB4373 shan mig    shan.mig@abcd.com    shan mig        1
17 BBB7270 na          mbautis@xyz.com      mbautis         2
18 BBB6165 na          ?slew.lang@abcd.com  slew lang       2
19 BBB7460 na          NA                   NA              3
20 BBB7528 na          na                   NA              3
21 BBB6092 NA          NA                   NA              3

注意group_bymutate的组合。这里的突变只在组内起作用。

最新更新