对于数据帧的partiguar列中的每个单元格(这里我们将其简单地命名为df(,我想找到最初表示为字符串并嵌入字符串中的最大值和最小值。单元格中的任何逗号都没有特殊意义。这些数字不应该是一个百分比,因此,例如,如果出现50%,则50将被排除在考虑之外。数据帧的相关列看起来像这样:
| particular_col_name |
| ------------------- |
| First Row String10. This is also a string_5, and so is this 20, exclude70% |
| Second_Row_50%, number40. Number 4. number_15|
因此,应该创建两个标题为"maximum_number"one_answers"minimum number"的新列,在第一行的情况下,前者应该分别为20和5。请注意,70已被排除在外,因为它旁边有%符号。同样,第二行应将40和4放入新列中。
我在dplyr"mutate"运算符中尝试了几种方法(例如str_extract_all、regmatches、strsplit(,但它们要么给出错误消息(特别是关于输入列particular_col_name(,要么没有以适当的格式输出数据,以便于识别最大值和最小值。
如有任何帮助,我们将不胜感激。
library(tidyverse)
tibble(
particular_col_name = c(
"First Row String10. This is also a string_5, and so is this 20, exclude70%",
"Second_Row_50%, number40. Number 4. number_15",
"20% 30%"
)
) %>%
mutate(
numbers = particular_col_name %>% map(~ {
.x %>% str_remove_all("[0-9]+%") %>% str_extract_all("[0-9]+") %>% simplify() %>% as.numeric()
}),
min = numbers %>% map_dbl(~ .x %>% min() %>% na_if(Inf) %>% na_if(-Inf)),
max = numbers %>% map_dbl(~ .x %>% max() %>% na_if(Inf) %>% na_if(-Inf))
) %>%
select(-numbers)
#> Warning in min(.): no non-missing arguments to min; returning Inf
#> Warning in max(.): no non-missing arguments to max; returning -Inf
#> # A tibble: 3 x 3
#> particular_col_name min max
#> <chr> <dbl> <dbl>
#> 1 First Row String10. This is also a string_5, and so is this 20, e… 5 20
#> 2 Second_Row_50%, number40. Number 4. number_15 4 40
#> 3 20% 30% NA NA
创建于2022-02-22由reprex包(v2.0.0(
我们可以将str_extract_all
与sapply
:结合使用
library(stringr)
df$min <- sapply(str_extract_all(df$particular_col_name, "[0-9]+"), function(x) min(as.integer(x)))
df$max <- sapply(str_extract_all(df$particular_col_name, "[0-9]+"), function(x) max(as.integer(x)))
particular_col_name min max
<chr> <int> <int>
1 First Row String10. This is also a string_5, and so is this 20, exclude70% 5 70
2 Second_Row_50%, number40. Number 4. number_15 4 50
数据:
df <- structure(list(particular_col_name = c("First Row String10. This is also a string_5, and so is this 20, exclude70%",
"Second_Row_50%, number40. Number 4. number_15"), min = 5:4,
max = c(70L, 50L)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))