如何在R中高效地重新编码多列?



我需要重新编码一些数据。首先,假设原始数据看起来像这样

A data.frame: 6 × 5
col1    col2    col3    col4    col5
<chr>   <chr>   <chr>   <chr>   <chr>
s1  414234  244575  539645  436236
s2      NA  512342  644252  835325
s3      NA      NA  816747  475295
s4      NA      NA      NA  125429
s5      NA      NA      NA      NA
s6  617465  844526      NA  194262

然后,将转换成

A data.frame: 6 × 5
col1    col2    col3    col4    col5
<chr>   <int>   <int>   <int>   <int>
s1       4       2       5       4
s2      NA       5       6       8
s3      NA       NA      8       4
s4      NA       NA     NA       1
s5      NA       NA     NA      NA
s6       6        8     NA       1

因为我要根据第一个数字重新编码。当第三次重新编码时(参见下面MWE的重新编码模式),它应该看起来像这样

A data.frame: 6 × 5
col1    col2    col3    col4    col5
<chr>   <int>   <int>   <int>   <int>
s1       3       1       3       3
s2      NA       3       4       5
s3      NA       NA      5       3
s4      NA       NA     NA       1
s5      NA       NA     NA      NA
s6       4        5     NA       1

,第四,如果除第一个列外的所有列都为空,则应该删除整个行,即

A data.frame: 6 × 5
col1    col2    col3    col4    col5
<chr>   <int>   <int>   <int>   <int>
s1       3       1       3       3
s2      NA       3       4       5
s3      NA       NA      5       3
s4      NA       NA     NA       1
s6       4        5     NA       1

是最终数据

第一步和第二步很容易实现,但我挣扎于第三步和第四步,因为我是R的新手(见下面的MWE)。对于第三步,我试图在多个列上使用mutate,但Error in UseMethod("mutate"): no applicable method for 'mutate' applied to an object of class "c('integer', 'numeric')"出现了。第四步很容易在Pythonthresh中实现,但我不确定R中是否有等价物。

这怎么可能?另外,我的工作是处理大量数据,所以高效的解决方案也会非常受欢迎。

library(dplyr)
df <- data.frame(
col1 = c("s1", "s2", "s3", "s4", "s5", "s6"),
col2 = c("414234", NA, NA, NA, NA, "617465"),
col3 = c("244575", "512342", NA, NA, NA, "844526"),
col4 = c("539645", "644252", "816747", NA, NA, NA),
col5 = c("436236", "835325", "475295", "125429", NA, "194262")
)
n = ncol(df)
for (i in colnames(df[2:n])) {
df[, i] = strtoi(substr(df[, i], 1, 1))
}
for (i in colnames(df[2:n])) {
df[, i] %>% mutate(i=recode(i, "0": 1, "1": 1, "2": 1, "3": 2, "4": 3, "5": 3, "6": 4, "7": 5, "8": 5))
}

Base R way:

# cut out just the numeric columns
df2 <- as.matrix(df[, -1])
# first digits
df2[] <- substr(df2, 1, 1)
mode(df2) <- 'numeric'
# recode
df2[] <- c(1, 1, 1, 2, 3, 3, 4, 5, 5)[df2+1]
# write back into the original data frame
df[, -1] <- df2
# remove rows with NAs only
df <- df[apply(df[, -1], 1, (x) !all(is.na(x))), ]
df
#   V1 V2 V3 V4 V5
# 1 s1  3  1  3  3
# 2 s2 NA  3  4  5
# 3 s3 NA NA  5  3
# 4 s4 NA NA NA  1
# 6 s6  4  5 NA  1

正如您所看到的,没有必要按列执行操作,因为它们可以整体执行,这将更有效。

您可以使用tidyverse包的组合来完成此操作。在R中,我们通常避免for循环,除非我们真的需要它们。几乎总是更可取的矢量化。

library(dplyr)
library(stringr) # for str_sub
library(purrr)   # for negate
mat = matrix(c(     "s1",     "s2",     "s3",     "s4",     "s5",     "s6",
"414234",       NA,       NA,       NA,       NA, "617465", 
"244575", "512342",       NA,       NA,       NA, "844526",
"539645", "644252", "816747",       NA,       NA,       NA,
"436236", "835325", "475295", "125429",       NA, "194262"),
nrow=6,
ncol=5
)
df <- as.data.frame(mat)
## Step 1: Extract first character of each element
df <- mutate(df, across(V2:V5, str_sub, 1, 1))
head(df)
#>   V1   V2   V3   V4   V5
#> 1 s1    4    2    5    4
#> 2 s2 <NA>    5    6    8
#> 3 s3 <NA> <NA>    8    4
#> 4 s4 <NA> <NA> <NA>    1
#> 5 s5 <NA> <NA> <NA> <NA>
#> 6 s6    6    8 <NA>    1
## Step 3: Recode
df <- mutate(df, 
across(V2:V5, 
recode,
`0` = "1", `1` = "1", `2` = "1", `3` = "2", 
`4` = "3", `5` = "3", `6` = "4", `7` = "5", `8` = "5"
))
## Step 2: convert all columns to numeric
df <- mutate(df, across(V2:V5, as.numeric))
head(df)
#>   V1 V2 V3 V4 V5
#> 1 s1  3  1  3  3
#> 2 s2 NA  3  4  5
#> 3 s3 NA NA  5  3
#> 4 s4 NA NA NA  1
#> 5 s5 NA NA NA NA
#> 6 s6  4  5 NA  1
## Step 4: filter all rows where every value is numeric
## By purrr::negate()-ing is.na, we can select rows only rows where 
## at least one value is not missing
df <- filter(df, if_any(V2:V5, negate(is.na)))
df
#>   V1 V2 V3 V4 V5
#> 1 s1  3  1  3  3
#> 2 s2 NA  3  4  5
#> 3 s3 NA NA  5  3
#> 4 s4 NA NA NA  1
#> 5 s6  4  5 NA  1

创建于2022-12-13与reprex v2.0.2

这个用了奇妙的数学

df |>
pivot_longer(col2:col5, values_to = "val", names_to = "col") |>
mutate(val = map_dbl(as.integer(val),
~c(1, 1, 1, 2, 3, 3, 4, 5, 5)[.x %/% 10^trunc(log10(.x)) +1])) |>
filter(!is.na(val)) |>
pivot_wider(values_from = val, names_from = col )

##> + # A tibble: 5 × 5
##>   col1   col2  col3  col4  col5
##>   <chr> <dbl> <dbl> <dbl> <dbl>
##> 1 s1        3     1     3     3
##> 2 s2       NA     3     4     5
##> 3 s3       NA    NA     5     3
##> 4 s4       NA    NA    NA     1
##> 5 s6        4     5    NA     1

最新更新