r-将一个热编码变量转换为一列



我有这样的年龄列,它们是伪编码的。如何使用dplyr将这些列转换为一列?

输入:

age_0-10 age_11-20 age_21-30 age_31-40 age_41-50 age_51-60 gender
1 0        1         0         0         0         0         0
2 0        0         1         0         0         0         1
3 0        0         0         1         0         0         0
4 0        1         0         0         0         0         1
5 0        0         0         0         0         1         1

预期输出:

age         gender
1 11-20     0
2 21-30     1
3 31-40     0
4 11-20     1
5 51-60     1

由于@Adam的评论,现在有了names_prefix:,这是一个可能的解决方案

library(tidyverse)
df <- data.frame(
check.names = FALSE,
`age_0-10` = c(0L, 0L, 0L, 0L, 0L),
`age_11-20` = c(1L, 0L, 0L, 1L, 0L),
`age_21-30` = c(0L, 1L, 0L, 0L, 0L),
`age_31-40` = c(0L, 0L, 1L, 0L, 0L),
`age_41-50` = c(0L, 0L, 0L, 0L, 0L),
`age_51-60` = c(0L, 0L, 0L, 0L, 1L),
gender = c(0L, 1L, 0L, 1L, 1L)
)
df %>% 
pivot_longer(col=starts_with("age"), names_to="age", names_prefix="age_") %>% 
filter(value==1) %>%
select(age, gender, -value)
#> # A tibble: 5 × 2
#>   age   gender
#>   <chr>  <int>
#> 1 11-20      0
#> 2 21-30      1
#> 3 31-40      0
#> 4 11-20      1
#> 5 51-60      1

以下是dplyr中使用c_across()的方法。

library(dplyr)
library(stringr)
df %>% 
rowwise() %>% 
mutate(age = str_remove(names(.)[which(c_across(starts_with("age")) == 1)], "^age_")) %>% 
ungroup() %>% 
select(age, gender)
# # A tibble: 5 x 2
#   age   gender
#   <chr>  <int>
# 1 11-20      0
# 2 21-30      1
# 3 31-40      0
# 4 11-20      1
# 5 51-60      1

使用max.col尝试下面的基本R代码

cbind(
age = gsub("^age_", "", head(names(df), -1)[max.col(df[-ncol(df)])]),
df[ncol(df)]
)

它给出

age gender
1 11-20      0
2 21-30      1
3 31-40      0
4 11-20      1
5 51-60      1

这里是另一个tidyverse解决方案:

library(dplyr)
library(purrr)
df %>%
mutate(age = pmap_chr(select(cur_data(), !gender), 
~ names(df)[-ncol(df)][as.logical(c(...))])) %>%
select(age, gender)
age gender
1 age_11-20      0
2 age_21-30      1
3 age_31-40      0
4 age_11-20      1
5 age_51-60      1

最新更新