我需要重新编码一些数据。首先,假设原始数据看起来像这样
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <chr> <chr> <chr> <chr>
s1 414234 244575 539645 436236
s2 NA 512342 644252 835325
s3 NA NA 816747 475295
s4 NA NA NA 125429
s5 NA NA NA NA
s6 617465 844526 NA 194262
然后,将转换成
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 4 2 5 4
s2 NA 5 6 8
s3 NA NA 8 4
s4 NA NA NA 1
s5 NA NA NA NA
s6 6 8 NA 1
因为我要根据第一个数字重新编码。当第三次重新编码时(参见下面MWE
的重新编码模式),它应该看起来像这样
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 3 1 3 3
s2 NA 3 4 5
s3 NA NA 5 3
s4 NA NA NA 1
s5 NA NA NA NA
s6 4 5 NA 1
,第四,如果除第一个列外的所有列都为空,则应该删除整个行,即
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 3 1 3 3
s2 NA 3 4 5
s3 NA NA 5 3
s4 NA NA NA 1
s6 4 5 NA 1
是最终数据
第一步和第二步很容易实现,但我挣扎于第三步和第四步,因为我是R
的新手(见下面的MWE
)。对于第三步,我试图在多个列上使用mutate
,但Error in UseMethod("mutate"): no applicable method for 'mutate' applied to an object of class "c('integer', 'numeric')"
出现了。第四步很容易在Python
和thresh
中实现,但我不确定R
中是否有等价物。
这怎么可能?另外,我的工作是处理大量数据,所以高效的解决方案也会非常受欢迎。
library(dplyr)
df <- data.frame(
col1 = c("s1", "s2", "s3", "s4", "s5", "s6"),
col2 = c("414234", NA, NA, NA, NA, "617465"),
col3 = c("244575", "512342", NA, NA, NA, "844526"),
col4 = c("539645", "644252", "816747", NA, NA, NA),
col5 = c("436236", "835325", "475295", "125429", NA, "194262")
)
n = ncol(df)
for (i in colnames(df[2:n])) {
df[, i] = strtoi(substr(df[, i], 1, 1))
}
for (i in colnames(df[2:n])) {
df[, i] %>% mutate(i=recode(i, "0": 1, "1": 1, "2": 1, "3": 2, "4": 3, "5": 3, "6": 4, "7": 5, "8": 5))
}
Base R way:
# cut out just the numeric columns
df2 <- as.matrix(df[, -1])
# first digits
df2[] <- substr(df2, 1, 1)
mode(df2) <- 'numeric'
# recode
df2[] <- c(1, 1, 1, 2, 3, 3, 4, 5, 5)[df2+1]
# write back into the original data frame
df[, -1] <- df2
# remove rows with NAs only
df <- df[apply(df[, -1], 1, (x) !all(is.na(x))), ]
df
# V1 V2 V3 V4 V5
# 1 s1 3 1 3 3
# 2 s2 NA 3 4 5
# 3 s3 NA NA 5 3
# 4 s4 NA NA NA 1
# 6 s6 4 5 NA 1
正如您所看到的,没有必要按列执行操作,因为它们可以整体执行,这将更有效。
您可以使用tidyverse包的组合来完成此操作。在R中,我们通常避免for循环,除非我们真的需要它们。几乎总是更可取的矢量化。
library(dplyr)
library(stringr) # for str_sub
library(purrr) # for negate
mat = matrix(c( "s1", "s2", "s3", "s4", "s5", "s6",
"414234", NA, NA, NA, NA, "617465",
"244575", "512342", NA, NA, NA, "844526",
"539645", "644252", "816747", NA, NA, NA,
"436236", "835325", "475295", "125429", NA, "194262"),
nrow=6,
ncol=5
)
df <- as.data.frame(mat)
## Step 1: Extract first character of each element
df <- mutate(df, across(V2:V5, str_sub, 1, 1))
head(df)
#> V1 V2 V3 V4 V5
#> 1 s1 4 2 5 4
#> 2 s2 <NA> 5 6 8
#> 3 s3 <NA> <NA> 8 4
#> 4 s4 <NA> <NA> <NA> 1
#> 5 s5 <NA> <NA> <NA> <NA>
#> 6 s6 6 8 <NA> 1
## Step 3: Recode
df <- mutate(df,
across(V2:V5,
recode,
`0` = "1", `1` = "1", `2` = "1", `3` = "2",
`4` = "3", `5` = "3", `6` = "4", `7` = "5", `8` = "5"
))
## Step 2: convert all columns to numeric
df <- mutate(df, across(V2:V5, as.numeric))
head(df)
#> V1 V2 V3 V4 V5
#> 1 s1 3 1 3 3
#> 2 s2 NA 3 4 5
#> 3 s3 NA NA 5 3
#> 4 s4 NA NA NA 1
#> 5 s5 NA NA NA NA
#> 6 s6 4 5 NA 1
## Step 4: filter all rows where every value is numeric
## By purrr::negate()-ing is.na, we can select rows only rows where
## at least one value is not missing
df <- filter(df, if_any(V2:V5, negate(is.na)))
df
#> V1 V2 V3 V4 V5
#> 1 s1 3 1 3 3
#> 2 s2 NA 3 4 5
#> 3 s3 NA NA 5 3
#> 4 s4 NA NA NA 1
#> 5 s6 4 5 NA 1
创建于2022-12-13与reprex v2.0.2
这个用了奇妙的数学
df |>
pivot_longer(col2:col5, values_to = "val", names_to = "col") |>
mutate(val = map_dbl(as.integer(val),
~c(1, 1, 1, 2, 3, 3, 4, 5, 5)[.x %/% 10^trunc(log10(.x)) +1])) |>
filter(!is.na(val)) |>
pivot_wider(values_from = val, names_from = col )
##> + # A tibble: 5 × 5
##> col1 col2 col3 col4 col5
##> <chr> <dbl> <dbl> <dbl> <dbl>
##> 1 s1 3 1 3 3
##> 2 s2 NA 3 4 5
##> 3 s3 NA NA 5 3
##> 4 s4 NA NA NA 1
##> 5 s6 4 5 NA 1