r语言 - 使用 Purrr 和 dplyr 跨多个数据帧重新编码相似的因子水平



下面是两个简单的数据框。我想重新编码(折叠)Sat1Sat2列,以便所有满意程度都简单地编码为Satisfied,所有程度的不满意都编码为Dissatisfied。中立将保持为中性。因此,这些因素将有三个层次 -Satisfied, Dissatisfied, and Neutral.

我通常会通过绑定数据帧并使用lapply以及来自car包的重新编码来实现这一点,例如:

DF1[2:3] <- lapply(DF1[2:3], recode, c('"Somewhat Satisfied"= "Satisfied","Satisfied"="Satisfied","Extremely Dissatisfied"="Dissatisfied"........etc, etc

我想使用地图函数来实现这一点,特别是purrrat_map(维护数据框,但我是purrr新手,所以请随时建议其他版本的地图),以及 ggplot2'anddplyr、整洁,字符串,因此一切都可以轻松流水线。

下面的示例是我想完成的,但用于重新编码,但我无法使其工作。

http://www.r-bloggers.com/using-purrr-with-dplyr/

我想使用 at_map 或类似的 map 函数,以便我可以保留Sat1Sat2的原始列,因此重新编码的列将被添加到数据框中并重命名。如果此步骤也可以包含在函数中,那就太好了。

实际上,我将有许多数据框,因此我只想对因子水平重新编码一次,然后使用purrr中的函数使用最少的代码对所有数据框进行更改。

Names<-c("James","Chris","Jessica","Tomoki","Anna","Gerald")
Sat1<-c("Satisfied","Very Satisfied","Dissatisfied","Somewhat Satisfied","Dissatisfied","Neutral")
Sat2<-c("Very Dissatisfied","Somewhat Satisfied","Neutral","Neutral","Satisfied","Satisfied")
Program<-c("A","B","A","C","B","D")
Pets<-c("Snake","Dog","Dog","Dog","Cat","None")
DF1<-data.frame(Names,Sat1,Sat2,Program,Pets)
Names<-c("Tim","John","Amy","Alberto","Desrahi","Francesca")
Sat1<-c("Extremely Satisfied","Satisfied","Satisfed","Somewhat Dissatisfied","Dissatisfied","Satisfied")
Sat2<-c("Dissatisfied","Somewhat Dissatisfied","Neutral","Extremely Dissatisfied","Somewhat Satisfied","Somewhat Dissatisfied")
Program<-c("A","B","A","C","B","D")

DF2<-data.frame(Names,Sat1,Sat2,Program)

一种方法是使用mutate_each结合其中一个map函数来完成工作,以浏览 data.frame 列表。 使用dplyr_0.4.3.9001中的mutate_each或等效项可以重命名新列。

在这种情况下,您可以使用字符串操作而不是重新编码。 我相信你想从你现有的字符串中提取SatisfiedDissatisfiedNeutral。 您可以通过使用正则表达式sub来实现此目的。 例如

sub(".*(Satisfied|Dissatisfied|Neutral).*$", "\1", DF2$Sat2)
"Dissatisfied" "Dissatisfied" "Neutral"      "Dissatisfied" "Satisfied"    "Dissatisfied"

串长有一个很好的功能,用于提取特定的字符串,str_extract.

library(stringr)
str_extract(DF2$Sat2, "Satisfied|Neutral|Dissatisfied")
"Dissatisfied" "Dissatisfied" "Neutral"      "Dissatisfied" "Satisfied"    "Dissatisfied"

您可以在mutate_each中使用它,以便在多个列上使用这些函数之一。 您在funs中为函数指定的名称将添加到新列名称中。 我用了recode. 对于其中一个数据集:

DF1 %>% 
mutate_each( funs(recode = str_extract(., "Satisfied|Neutral|Dissatisfied") ), 
starts_with("Sat") )
Names               Sat1               Sat2 Program  Pets  Sat1_recode  Sat2_recode
1   James          Satisfied  Very Dissatisfied       A Snake    Satisfied Dissatisfied
2   Chris     Very Satisfied Somewhat Satisfied       B   Dog    Satisfied    Satisfied
3 Jessica       Dissatisfied            Neutral       A   Dog Dissatisfied      Neutral
4  Tomoki Somewhat Satisfied            Neutral       C   Dog    Satisfied      Neutral
5    Anna       Dissatisfied          Satisfied       B   Cat Dissatisfied    Satisfied
6  Gerald            Neutral          Satisfied       D  None      Neutral    Satisfied

要遍历存储在列表中的许多数据集,您可以使用purrr中的map函数对列表中的每个元素执行函数。

list(DF1, DF2) %>%
map(~mutate_each(.x, 
funs(recode = str_extract(., "Satisfied|Neutral|Dissatisfied") ), 
starts_with("Sat")) )
[[1]]
Names               Sat1               Sat2 Program  Pets  Sat1_recode  Sat2_recode
1   James          Satisfied  Very Dissatisfied       A Snake    Satisfied Dissatisfied
2   Chris     Very Satisfied Somewhat Satisfied       B   Dog    Satisfied    Satisfied
...
[[2]]
Names                  Sat1                   Sat2 Program  Sat1_recode  Sat2_recode
1       Tim   Extremely Satisfied           Dissatisfied       A    Satisfied Dissatisfied
2      John             Satisfied  Somewhat Dissatisfied       B    Satisfied Dissatisfied
...

改用map_df会将列表中的所有元素绑定到 data.frame 中,这可能是您想要的,也可能不是您想要的。 使用.id参数为每个原始数据集添加一个名称。

list(DF1, DF2) %>%
map_df(~mutate_each(.x, 
funs(recode = str_extract(., "Satisfied|Neutral|Dissatisfied")), 
starts_with("Sat")), .id = "Group")
Group     Names                  Sat1                   Sat2 Program  Pets  Sat1_recode
1      1     James             Satisfied      Very Dissatisfied       A Snake    Satisfied
2      1     Chris        Very Satisfied     Somewhat Satisfied       B   Dog    Satisfied
3      1   Jessica          Dissatisfied                Neutral       A   Dog Dissatisfied
4      1    Tomoki    Somewhat Satisfied                Neutral       C   Dog    Satisfied
5      1      Anna          Dissatisfied              Satisfied       B   Cat Dissatisfied
6      1    Gerald               Neutral              Satisfied       D  None      Neutral
7      2       Tim   Extremely Satisfied           Dissatisfied       A  <NA>    Satisfied
8      2      John             Satisfied  Somewhat Dissatisfied       B  <NA>    Satisfied
...

我使用联接进行这样的大型重新编码,在这种情况下,我认为转换为长数据帧使问题更容易思考。

library(tidyr)
library(dplyr)
mdf <- DF1 %>% 
gather(var, value, starts_with("Sat"))
recode_df <- data_frame( value = c("Extremely Satisfied","Satisfied","Somewhat Dissatisfied","Dissatisfied"),
recode = 1:4)
mdf <- left_join(mdf, recode_df)
mdf %>% spread(var, recode)

最新更新