如何从unigram中删除(自定义)停止语,但将其保留在bigram中



我正在处理IMDB电影评级数据集,并努力进行数据预处理。有一些与电影相关的词出现在许多评级中,但作为一个unigram是没有信息的,即";电影";。然而,如果评级显示";好电影";或";坏电影";,这是信息丰富的,我想保留这一点。不幸的是,我还不能把我的代码做到这一点:

library(tidyverse)
library(tidymodels)
library(textrecipes)
movie_stopwords <- tibble(word = c("movie","movies","movie's","act","acts","actor","actors",
"actress","actresses","actor's","actress´s",
"film","film's","director","directors","director's",
"character", "characters", "character's"))
my_corpus <- tibble(sentiment = c("positive","negative","positive"),
rating = c("this is a good movie","this movie sucks", "this movie has a good plot"))
# print the final unigrams, bigrams and trigrams
recipe(sentiment ~ rating, data = my_corpus) %>% 
step_tokenize(rating) %>% 
step_stopwords(rating, stopword_source = "marimo") %>% 
step_ngram(rating, min_num_tokens = 1, num_tokens = 3) %>% 
step_stopwords(rating, custom_stopword_source = movie_stopwords) %>% 
step_untokenize(rating) %>% 
prep() %>% bake(new_data = NULL)

这会输出以下tibble:

# OUTPUT AS IS
# A tibble: 3 x 2
rating                                               sentiment
<fct>                                                <fct>    
1 good movie good_movie                                positive 
2 movie sucks movie_sucks                              negative 
3 movie good plot movie_good good_plot movie_good_plot positive 

然而,我更喜欢unigram";电影";我真的希望第二个step_stopwords能做到这一点。

Does anyone have an idea how to do that efficiently (i.e. for 50k ratings)?
# OUTPUT AS I WANT IT TO BE
# A tibble: 3 x 2
rating                                               sentiment
<fct>                                                <fct>    
1 good good_movie                                positive 
2 sucks movie_sucks                              negative 
3 good plot movie_good good_plot movie_good_plot positive

custom_stop_words应该是charactervector,而不是data.frame/tible

根据?step_stopwords

custom_stop_words-一个字符向量,用于指示满足用户特定问题的自定义单词列表。

library(tidymodels)
library(magrittr)
library(textrecipes)
recipe(sentiment ~ rating, data = my_corpus) %>% 
step_tokenize(rating) %>% 
step_stopwords(rating, stopword_source = "marimo") %>% 
step_ngram(rating, min_num_tokens = 1, num_tokens = 3) %>% 
step_stopwords(rating, custom_stopword_source = movie_stopwords$word) %>% 
step_untokenize(rating) %>% 
prep() %>% 
bake(new_data = NULL)

-输出

# A tibble: 3 x 2
#  rating                                         sentiment
#  <fct>                                          <fct>    
#1 good good_movie                                positive 
#2 sucks movie_sucks                              negative 
#3 good plot movie_good good_plot movie_good_plot positive 

最新更新