我试图检查重复的电子邮件数据,如果有重复的名字与其他电子邮件。它工作,但如果有相同名称的副本,那么它应该将所有名称反映为副本。
因此,例如,如果abcd@ddd.com有多个条目,如abcd@ccc.com或abcd@fff.com,所有三个都应该反映为重复的。
同样,如果abby.del@ddd.com有多个条目,如abby-del@ccc.com或abby_del@fff.com,则所有三个都应反映为重复的。
df <- data.frame(EMP.ID = c(88111,"BBB4477","BBB4058","BBB5832","BBB0338","BBB1814","BBB6543",875430,875970,"BBB0243","BBB1943","BBB9344","BBB9701","BBB1814","BBB8648","BBB4373","BBB7270","BBB6165","BBB7460","BBB7528","BBB6092"),
name = c("link adam","dy tt","link adam","gbesada","dojeda","slew lang","?alpucheta","r zona","jachaval","allo nyyn","mbautis","grand fring","jali","kintom dang","namoti","shan mig","NA","NA","NA","NA",NA),
email = c("link.adam@gmail.com","dy.tt@abcd.com","link_adam@gmail.com","gbesada@abcd.com","dojeda@abcd.com","?slew.lang@abcd.com","dy-tt@abcd.com","?rzona@abcd.com","jachaval@abcd.com","allo@abcd.com","mbautis@abcd.com","grand.fring@abcd.com","jali@abcd.com","kintom.dang@abcd.com","namoti@abcd.com","shan.mig@abcd.com","mbautis@XYZ.com","?slew.lang@abcd.com",NA,"NA",NA))
separator= " "
valuesToIgnore <- c(NA, NA)
df <- df %>%
mutate(across(c(name,email), tolower)) %>%
mutate(email_name1 = str_extract(email, "([a-z.]+)(?=@.+)")) %>%
mutate(email_name1 = str_replace_all(email_name1, "\.", separator)) %>%
mutate(`13. duplicate name with mailid` = ifelse(duplicated(email_name1, incomparables=valuesToIgnore),"Duplicate email username exists",NA))
我已经尝试了很多解决方案,有没有永久性的解决方案来处理电子邮件数据…??
我将解决您的问题如下(也注意在正则表达式的变化):
df %>%
mutate(across(c(name,email), tolower)) %>%
mutate(email_name1 = str_extract(email, "([^@]+)@")) %>%
mutate(email_name1 = str_replace_all(email_name1, "[\W_]", "")) %>% %>%
group_by(email_name1) %>%
mutate(count = n())
# A tibble: 21 x 5
# Groups: email_name1 [17]
EMP.ID name email email_name1 count
<chr> <chr> <chr> <chr> <int>
1 88111 link adam link.adam@gmail.com link adam 1
2 BBB4477 dy tt dy.tt@abcd.com dy tt 1
3 BBB4058 link adam link_adam@gmail.com adam 1
4 BBB5832 gbesada gbesada@abcd.com gbesada 1
5 BBB0338 dojeda dojeda@abcd.com dojeda 1
6 BBB1814 slew lang ?slew.lang@abcd.com slew lang 2
7 BBB6543 ?alpucheta dy-tt@abcd.com tt 1
8 875430 r zona ?rzona@abcd.com rzona 1
9 875970 jachaval jachaval@abcd.com jachaval 1
10 BBB0243 allo nyyn allo@abcd.com allo 1
11 BBB1943 mbautis mbautis@abcd.com mbautis 2
12 BBB9344 grand fring grand.fring@abcd.com grand fring 1
13 BBB9701 jali jali@abcd.com jali 1
14 BBB1814 kintom dang kintom.dang@abcd.com kintom dang 1
15 BBB8648 namoti namoti@abcd.com namoti 1
16 BBB4373 shan mig shan.mig@abcd.com shan mig 1
17 BBB7270 na mbautis@xyz.com mbautis 2
18 BBB6165 na ?slew.lang@abcd.com slew lang 2
19 BBB7460 na NA NA 3
20 BBB7528 na na NA 3
21 BBB6092 NA NA NA 3
注意group_by
和mutate
的组合。这里的突变只在组内起作用。