r语言 - 如何根据优先级单词列表(与for循环和条件)查找和替换df中的值?



我在数据框架中有一个列,每个单元格中有多个单词,用";">

(第二列)。
my_dataframe <- data.frame( first_column = c("x", "y", "x", "x", "y"),
second_column = c("important; very important; not important",
"not important; important; very important",
"very important; important",
"important; not important",
"not important"))
> my_dataframe
first_column                            second_column
1            x important; very important; not important
2            y not important; important; very important
3            x                very important; important
4            x                 important; not important
5            y                            not important

我想每个单元格保留一个单词:最重要的一个。

所以我按照优先级的顺序列出了一个单词列表:

reference_importance <- list("very important", "important", "not important")

我想要得到的第二列:

second_column
1 very important
2 very important
3 very important
4 important
5 not important

我试着

for (i in 1:dim(my_dataframe)[1]) {
for (j in 1:length(reference_importance)) {
if (j %in% my_dataframe$second_column){
my_dataframe$second_column[i] <- paste(j)
break}
}
}

然后我想问题是它没有考虑用";"分隔的不同单词。所以我试着这样做:

for (i in 1:dim(my_dataframe)[1]) {
value_as_list <- strsplit(my_dataframe$second_column[i], ";")
print(value_as_list)
for (j in reference_importance) {
if (j %in% value_as_list){
my_dataframe$second_column[i] == j
break}
}
} 

但是这些不会改变我的专栏…

(我做这个例子是为了简化,但实际上我有一个包含更多单词和可能性的大表。这就是为什么我尝试用循环来做,而不是手动分配可能的答案。

基本使用strsplitmatch

my_dataframe <- transform(my_dataframe, z=strsplit(second_column, '; ') |>
lapply(match, reference_importance) |>
sapply(min) |>
{(x) unlist(reference_importance)[x]}())
my_dataframe
#   first_column                            second_column              z
# 1            x important; very important; not important very important
# 2            y not important; important; very important very important
# 3            x                very important; important very important
# 4            x                 important; not important      important
# 5            y                            not important  not important

注意:R>= 4.1 .

如果你需要一个循环,你可以使用

spl <- strsplit(my_dataframe$second_column, '; ')
my_dataframe$z <- NA_character_
for (i in seq_along(spl)) {
my_dataframe$z[i] <- reference_importance[[min(match(spl[[i]], reference_importance))]]
}
my_dataframe
#   first_column                            second_column              z
# 1            x important; very important; not important very important
# 2            y not important; important; very important very important
# 3            x                very important; important very important
# 4            x                 important; not important      important
# 5            y                            not important  not important

当然我用z是为了演示,实际上你应该用second_column而不是z

如果你想使用循环,下面的方法对我很有效:

my_dataframe <- data.frame( first_column = c("x", "y", "x", "x", "y"),
second_column = c("important; very important; not important",
"not important; important; very important",
"very important; important",
"important; not important",
"not important"))
reference_importance <- list("very important", "important", "not important")

# add new column for priority word 
my_dataframe <- my_dataframe %>%
mutate(Priority_importance = NA)
# use a loop to identify highest priority substring
for (i in 1:nrow(my_dataframe)) {
value_as_list <- strsplit(my_dataframe$second_column[i], ";")

for (j in  1:length(reference_importance)) {
if (value_as_list == as.character((reference_importance[j]))) { 
my_dataframe$Priority_importance[i] <- reference_importance[j] # paste importance level 
break # move to next iteration 
}
}
}
my_dataframe
first_column                            second_column Priority_importance
1            x important; very important; not important      very important
2            y not important; important; very important      very important
3            x                very important; important      very important
4            x                 important; not important           important
5            y                            not important       not important

dplyrtidyr的一个选项:

my_dataframe %>%
rowid_to_column() %>%
separate_rows(second_column, sep = "; ") %>%
group_by(rowid) %>%
slice_min(match(second_column, reference_importance))
rowid first_column second_column 
<int> <chr>        <chr>         
1     1 x            very important
2     2 y            very important
3     3 x            very important
4     4 x            important     
5     5 y            not important 

我使用reference_importance作为字符向量而不是列表:

reference_importance <- c("very important", "important", "not important")

另一种可能的解决方案,基于tidyverse:

library(tidyverse)
my_dataframe %>% 
mutate(id = row_number()) %>% 
separate_rows(second_column, sep = "\s*;\s*") %>% 
group_by(id) %>% 
slice(match(reference_importance, second_column) %>% na.omit() %>% .[1]) %>% 
ungroup %>% 
select(-id)
#> # A tibble: 5 × 2
#>   first_column second_column 
#>   <chr>        <chr>         
#> 1 x            very important
#> 2 y            very important
#> 3 x            very important
#> 4 x            important     
#> 5 y            not important

最新更新