我的数据包括一个Name列。有些名字有八种不同的写法。我试着用以下代码对它们进行分组:
groups <- list()
i <- 1
while(length(x) > 0)
{
id <- agrep(x[1], x, ignore.case = TRUE, max.distance = 0.1)
groups[[i]] <- x[id]
x <- x[-id]
i <- i + 1
}
head(groups)
groups
接下来,我想添加一个新列,它为每一行返回例如最常用的名称表示法。结果应该是:
A B
1. John Snow John Snow
2. Peter Wright Peter Wright
3. john snow John Snow
4. John snow John Snow
5. Peter wright Peter Wright
6. J. Snow John Snow
7. John Snow John Snow
etc.
我怎样才能到那里?
这个答案在很大程度上基于以前的一个将字符串分组的问题/答案。这个答案只是增加了为每组查找模式并为原始字符串分配正确的模式。
## The data
Names = c("John Snow", "Peter Wright", "john snow",
"John snow", "Peter wright", "J. Snow", "John Snow")
## Grouping like in the previous question
groups <- list()
i <- 1
x = Names
while(length(x) > 0)
{
id <- agrep(x[1], x, ignore.case = TRUE, max.distance = 0.25)
groups[[i]] <- x[id]
x <- x[-id]
i <- i + 1
}
## Find the mode for each group
Modes = sapply(groups, function(x) names(which.max(table(x))))
## Assign the correct mode to each string
StandardName = rep("", length(Names))
for(i in seq_along(groups)) {
StandardName[Names %in% groups[[i]]] = Modes[i]
}
StandardName
[1] "John Snow" "Peter wright" "John Snow" "John Snow" "Peter wright"
[6] "John Snow" "John Snow"
您可能需要对agrep
的max.distance
参数的正确值进行实验。
如果你想把答案添加到数据帧中,只需添加
df$StandardName = StandardName
要编写结果以便从Excel访问,请使用
write.csv(df, "MyData.csv")