我有两个数据集,都有狗的品种名称。我想以某种方式将数据集组合起来,但很多狗都有多个名字,所以例如非洲无毛狗也被称为非洲阿比西尼亚梗。
我认为这些名字太不同了,无法通过单词的相似性来匹配,所以我想知道匹配它们的最佳方式是什么。
例如,假设数据集1是
Breed Height (inches)
------------------------------------------------
Golden Retriever 20
Labrador Retriever 22
African Hairless Dog 17
数据集2是
Breed Weight (pounds)
------------------------------------------------
African Abyssinian Terrier 30
Golden Retriever 60
Labrador Retriever 65
我想要一个如下的数据集:
Breed Height (inches) Weight (pounds)
-------------------------------------------------------------------------
African Hairless Dog 17 30
Golden Retriever 20 60
Labrador Retriever 22 65
如果在两个数据帧中有一个唯一的id
列变量,那么left_join
就可以了。例如
df1 = data.frame(id = c(1,2,3), breed = c("African Abyssinian Terrier", "Golden Retriever","Labrador Retriever"), height= c(17, 20 , 22))
df2 = data.frame(id = c(1,2,3), breed = c("African Hairless Dog", "Golden Retriever","Labrador Retriever"), weight = c(30, 60 , 65))
library(dplyr)
> df1 %>% left_join(df2, by = c("id" = "id"))
id breed.x height breed.y weight
1 1 African Abyssinian Terrier 17 African Hairless Dog 30
2 2 Golden Retriever 20 Golden Retriever 60
3 3 Labrador Retriever 22 Labrador Retriever 65
如果你想要,你可以做一些清理
df3 = df1 %>%
left_join(df2, by = c("id" = "id")) %>%
select(-breed.y) %>%
rename(breed = breed.x)
> df3
id breed height weight
1 1 African Abyssinian Terrier 17 30
2 2 Golden Retriever 20 60
3 3 Labrador Retriever 22 65
根据James Curran的回答想出了一个解决方案。创建一个数据帧df3
,其中每行包含一个品种的所有可能昵称,以及一个唯一的id号。使用filter_all
从每个数据集中筛选唯一id号,然后使用唯一id号加入每个数据集。
df1 <- data.frame(breed = c("Labrador Retriever", "GR","African Abyssinian"), height= c(17, 20 , 22))
df2 <- data.frame(breed = c("African Hairless Dog", "Golden Retriever","Labrador Retriever"), weight = c(30, 60 , 65))
#create data frame, df3
df3 <- data.frame(id = 1:3, breed1 = c("African Hairless Dog", "Golden Retriever","Labrador Retriever"), breed2 = c("African Abyssinian Terrier", "GR","LR"))
id1 <- data.frame()
id2 <- data.frame()
for (i in 1:3){
id <- df3 %>% filter_all(any_vars(grepl(df1$breed[i],.))) %>% select(id)
id <- cbind(id,df1$breed[i]) %>% rename(breed = `df1$breed[i]`)
id1 <- rbind(id1,id)
id <- df3 %>% filter_all(any_vars(grepl(df2$breed[i],.))) %>% select(id)
id <- cbind(id,df2$breed[i]) %>% rename(breed = `df2$breed[i]`)
id2 <- rbind(id2,id)
}
df1 <- left_join(df1,id1) #keep breed name from this data set
df2 <- left_join(df2,id2) %>% select(-breed)
final_data_frame <- left_join(df1,df2, by=c('id'))
如果有人有更优雅的解决方案,也将不胜感激!