我有以下数据:
df_a <- data.frame(x = c("season", "season", "season", "package", "package"), x1 = c("1","2", "3", "1","6"))
df_b <- data.frame(y = c("seaason", "lalala", "package", "paackage", "pakkage", "blabla"), y2 = c("1","2", "3", "2", "4", "6"))
df_c <- data.frame(z = c("season", "sessson", "saeson", "package", "pakkage"), y3 = c("1","2", "3", "2","6"))
df_a
x x1
1 season 1
2 season 2
3 season 3
4 package 1
5 package 6
df_b
y y2
1 seaason 1
2 lalala 2
3 package 3
4 paackage 2
5 pakkage 4
6 blabla 6
df_c
z y3
1 season 1
2 sessson 2
3 saeson 3
4 package 2
5 pakkage 6
上面的数据帧具有不同的列名(x, y and z
),但它们反映了相同的信息。除了不同的列名之外,类型(season and package
)的方式也不(总是相同的)。
在现实生活中,这是不一起工作的个体的结果,对同一事物使用不同的命名约定。这产生了很多问题,因为我不仅必须手动连接这些列名,我甚至必须尝试模糊连接类型。
我在想是否有可能制作某种字典,告诉我x, y and z
实际上是相同的东西(让我们说x == y|z
),并且类型(season == sexson | seaason | etc.
)类似。
我认为最好的方法可能是创建一个函数,扫描字典中的每个df中的列名,复制它们并将它们转换为我选择的名称,对列内容做同样的事情。
我在想一个可以输入字典的求和函数。
dfs <- c(df_a, df_b, df_c)
vector_of_column_names <- c("x", "y", "z"
column_conversion <- function(dfs, vector_of_column_names) {
for (i in dfs) {
index <- match(names(dfs[i]),vector_of_column_names)
names(dfs[i])[index] <- vector_of_column_names[1] # The first vector item is the name used.
}
}
然而,我在如何开始这本字典方面遇到了一点麻烦。有什么建议吗?所需输出:
df_a <- data.frame(x = c("season", "season", "season", "package", "package"), x1 = c("1","2", "3", "1","6"))
df_b <- data.frame(y = c("seaason", "lalala", "package", "paackage", "pakkage", "blabla"), x = c("season", NA, "package", "package", "package", NA), y2 = c("1","2", "3", "2", "4", "6"))
df_c <- data.frame(z = c("season", "sessson", "saeson", "package", "pakkage"), x = c("season", "season", "season", "package", "package"), y3 = c("1","2", "3", "2","6"))
df_a
x x1
1 season 1
2 season 2
3 season 3
4 package 1
5 package 6
df_b
y x y2
1 seaason season 1
2 lalala <NA> 2
3 package package 3
4 paackage package 2
5 pakkage package 4
6 blabla <NA> 6
df_c
z x y3
1 season season 1
2 sessson season 2
3 saeson season 3
4 package package 2
5 pakkage package 6
如果您对模糊连接不感兴趣,我可以看到的两个选择是命名列表/向量(本质上是字典结构)和哈希库。为了简单起见,您可能应该以命名列表为目标。
library(magrittr)
library(dplyr)
library(stringr)
my_list <- c("sexson" = "season", "paackage" = "package", "pakkage" = "package")
df_b %>%
mutate(x = str_replace_all(x, my_list, names(my_list))
您也可以查看Hunspell包以避免手动执行此操作。
假设您已经在df中设置了列表,我们可以将不正确的文本提取为一个列表,将正确的文本提取为另一个列表,然后相应地设置名称并像这样替换:
incorrect <- df_b$y
correct <- df_b$x
names(correct) <- incorrect
df_b %>%
mutate(y = str_replace_all(y, correct, names(correct)))