我在数据框架中有一个列,每个单元格中有多个单词,用";">
(第二列)。my_dataframe <- data.frame( first_column = c("x", "y", "x", "x", "y"),
second_column = c("important; very important; not important",
"not important; important; very important",
"very important; important",
"important; not important",
"not important"))
> my_dataframe
first_column second_column
1 x important; very important; not important
2 y not important; important; very important
3 x very important; important
4 x important; not important
5 y not important
我想每个单元格保留一个单词:最重要的一个。
所以我按照优先级的顺序列出了一个单词列表:
reference_importance <- list("very important", "important", "not important")
我想要得到的第二列:
second_column
1 very important
2 very important
3 very important
4 important
5 not important
我试着
for (i in 1:dim(my_dataframe)[1]) {
for (j in 1:length(reference_importance)) {
if (j %in% my_dataframe$second_column){
my_dataframe$second_column[i] <- paste(j)
break}
}
}
然后我想问题是它没有考虑用";"分隔的不同单词。所以我试着这样做:
for (i in 1:dim(my_dataframe)[1]) {
value_as_list <- strsplit(my_dataframe$second_column[i], ";")
print(value_as_list)
for (j in reference_importance) {
if (j %in% value_as_list){
my_dataframe$second_column[i] == j
break}
}
}
但是这些不会改变我的专栏…
(我做这个例子是为了简化,但实际上我有一个包含更多单词和可能性的大表。这就是为什么我尝试用循环来做,而不是手动分配可能的答案。
基本使用strsplit
和match
my_dataframe <- transform(my_dataframe, z=strsplit(second_column, '; ') |>
lapply(match, reference_importance) |>
sapply(min) |>
{(x) unlist(reference_importance)[x]}())
my_dataframe
# first_column second_column z
# 1 x important; very important; not important very important
# 2 y not important; important; very important very important
# 3 x very important; important very important
# 4 x important; not important important
# 5 y not important not important
注意:R>= 4.1 .
如果你需要一个循环,你可以使用
spl <- strsplit(my_dataframe$second_column, '; ')
my_dataframe$z <- NA_character_
for (i in seq_along(spl)) {
my_dataframe$z[i] <- reference_importance[[min(match(spl[[i]], reference_importance))]]
}
my_dataframe
# first_column second_column z
# 1 x important; very important; not important very important
# 2 y not important; important; very important very important
# 3 x very important; important very important
# 4 x important; not important important
# 5 y not important not important
当然我用z
是为了演示,实际上你应该用second_column
而不是z
。
如果你想使用循环,下面的方法对我很有效:
my_dataframe <- data.frame( first_column = c("x", "y", "x", "x", "y"),
second_column = c("important; very important; not important",
"not important; important; very important",
"very important; important",
"important; not important",
"not important"))
reference_importance <- list("very important", "important", "not important")
# add new column for priority word
my_dataframe <- my_dataframe %>%
mutate(Priority_importance = NA)
# use a loop to identify highest priority substring
for (i in 1:nrow(my_dataframe)) {
value_as_list <- strsplit(my_dataframe$second_column[i], ";")
for (j in 1:length(reference_importance)) {
if (value_as_list == as.character((reference_importance[j]))) {
my_dataframe$Priority_importance[i] <- reference_importance[j] # paste importance level
break # move to next iteration
}
}
}
my_dataframe
first_column second_column Priority_importance
1 x important; very important; not important very important
2 y not important; important; very important very important
3 x very important; important very important
4 x important; not important important
5 y not important not important
dplyr
和tidyr
的一个选项:
my_dataframe %>%
rowid_to_column() %>%
separate_rows(second_column, sep = "; ") %>%
group_by(rowid) %>%
slice_min(match(second_column, reference_importance))
rowid first_column second_column
<int> <chr> <chr>
1 1 x very important
2 2 y very important
3 3 x very important
4 4 x important
5 5 y not important
我使用reference_importance作为字符向量而不是列表:
reference_importance <- c("very important", "important", "not important")
另一种可能的解决方案,基于tidyverse
:
library(tidyverse)
my_dataframe %>%
mutate(id = row_number()) %>%
separate_rows(second_column, sep = "\s*;\s*") %>%
group_by(id) %>%
slice(match(reference_importance, second_column) %>% na.omit() %>% .[1]) %>%
ungroup %>%
select(-id)
#> # A tibble: 5 × 2
#> first_column second_column
#> <chr> <chr>
#> 1 x very important
#> 2 y very important
#> 3 x very important
#> 4 x important
#> 5 y not important