创建字典,用于自动重命名r中的列和列条目



我有以下数据:

df_a <- data.frame(x = c("season", "season", "season", "package", "package"), x1 = c("1","2", "3", "1","6"))
df_b <- data.frame(y = c("seaason", "lalala", "package", "paackage", "pakkage", "blabla"), y2 = c("1","2", "3", "2", "4", "6"))
df_c <- data.frame(z = c("season", "sessson", "saeson", "package", "pakkage"), y3 = c("1","2", "3", "2","6"))
df_a
x x1
1  season  1
2  season  2
3  season  3
4 package  1
5 package  6
df_b
y y2
1  seaason  1
2   lalala  2
3  package  3
4 paackage  2
5  pakkage  4
6   blabla  6
df_c
z y3
1  season  1
2 sessson  2
3  saeson  3
4 package  2
5 pakkage  6

上面的数据帧具有不同的列名(x, y and z),但它们反映了相同的信息。除了不同的列名之外,类型(season and package)的方式也不(总是相同的)。

在现实生活中,这是不一起工作的个体的结果,对同一事物使用不同的命名约定。这产生了很多问题,因为我不仅必须手动连接这些列名,我甚至必须尝试模糊连接类型。

我在想是否有可能制作某种字典,告诉我x, y and z实际上是相同的东西(让我们说x == y|z),并且类型(season == sexson | seaason | etc.)类似。

我认为最好的方法可能是创建一个函数,扫描字典中的每个df中的列名,复制它们并将它们转换为我选择的名称,对列内容做同样的事情。

我在想一个可以输入字典的求和函数。

dfs <- c(df_a, df_b, df_c)
vector_of_column_names <- c("x", "y", "z"
column_conversion <- function(dfs, vector_of_column_names) {
for (i in dfs) {
index <- match(names(dfs[i]),vector_of_column_names)
names(dfs[i])[index] <- vector_of_column_names[1] # The first vector item is the name used. 
}
}

然而,我在如何开始这本字典方面遇到了一点麻烦。有什么建议吗?所需输出:

df_a <- data.frame(x = c("season", "season", "season", "package", "package"), x1 = c("1","2", "3", "1","6"))
df_b <- data.frame(y = c("seaason", "lalala", "package", "paackage", "pakkage", "blabla"), x = c("season", NA, "package", "package", "package", NA), y2 = c("1","2", "3", "2", "4", "6"))
df_c <- data.frame(z = c("season", "sessson", "saeson", "package", "pakkage"), x = c("season", "season", "season", "package", "package"), y3 = c("1","2", "3", "2","6"))
df_a
x x1
1  season  1
2  season  2
3  season  3
4 package  1
5 package  6
df_b
y       x y2
1  seaason  season  1
2   lalala    <NA>  2
3  package package  3
4 paackage package  2
5  pakkage package  4
6   blabla    <NA>  6
df_c
z       x y3
1  season  season  1
2 sessson  season  2
3  saeson  season  3
4 package package  2
5 pakkage package  6

如果您对模糊连接不感兴趣,我可以看到的两个选择是命名列表/向量(本质上是字典结构)和哈希库。为了简单起见,您可能应该以命名列表为目标。

library(magrittr)
library(dplyr)
library(stringr)
my_list <- c("sexson" = "season", "paackage" = "package", "pakkage" = "package")

df_b %>%
mutate(x = str_replace_all(x, my_list, names(my_list))

您也可以查看Hunspell包以避免手动执行此操作。

假设您已经在df中设置了列表,我们可以将不正确的文本提取为一个列表,将正确的文本提取为另一个列表,然后相应地设置名称并像这样替换:

incorrect <- df_b$y
correct <- df_b$x
names(correct) <- incorrect
df_b %>%
mutate(y = str_replace_all(y, correct, names(correct)))

最新更新