我正在处理IMDB电影评级数据集,并努力进行数据预处理。有一些与电影相关的词出现在许多评级中,但作为一个unigram是没有信息的,即";电影";。然而,如果评级显示";好电影";或";坏电影";,这是信息丰富的,我想保留这一点。不幸的是,我还不能把我的代码做到这一点:
library(tidyverse)
library(tidymodels)
library(textrecipes)
movie_stopwords <- tibble(word = c("movie","movies","movie's","act","acts","actor","actors",
"actress","actresses","actor's","actress´s",
"film","film's","director","directors","director's",
"character", "characters", "character's"))
my_corpus <- tibble(sentiment = c("positive","negative","positive"),
rating = c("this is a good movie","this movie sucks", "this movie has a good plot"))
# print the final unigrams, bigrams and trigrams
recipe(sentiment ~ rating, data = my_corpus) %>%
step_tokenize(rating) %>%
step_stopwords(rating, stopword_source = "marimo") %>%
step_ngram(rating, min_num_tokens = 1, num_tokens = 3) %>%
step_stopwords(rating, custom_stopword_source = movie_stopwords) %>%
step_untokenize(rating) %>%
prep() %>% bake(new_data = NULL)
这会输出以下tibble:
# OUTPUT AS IS
# A tibble: 3 x 2
rating sentiment
<fct> <fct>
1 good movie good_movie positive
2 movie sucks movie_sucks negative
3 movie good plot movie_good good_plot movie_good_plot positive
然而,我更喜欢unigram";电影";我真的希望第二个step_stopwords
能做到这一点。
Does anyone have an idea how to do that efficiently (i.e. for 50k ratings)?
# OUTPUT AS I WANT IT TO BE
# A tibble: 3 x 2
rating sentiment
<fct> <fct>
1 good good_movie positive
2 sucks movie_sucks negative
3 good plot movie_good good_plot movie_good_plot positive
custom_stop_words
应该是character
vector
,而不是data.frame/tible
根据?step_stopwords
custom_stop_words-一个字符向量,用于指示满足用户特定问题的自定义单词列表。
library(tidymodels)
library(magrittr)
library(textrecipes)
recipe(sentiment ~ rating, data = my_corpus) %>%
step_tokenize(rating) %>%
step_stopwords(rating, stopword_source = "marimo") %>%
step_ngram(rating, min_num_tokens = 1, num_tokens = 3) %>%
step_stopwords(rating, custom_stopword_source = movie_stopwords$word) %>%
step_untokenize(rating) %>%
prep() %>%
bake(new_data = NULL)
-输出
# A tibble: 3 x 2
# rating sentiment
# <fct> <fct>
#1 good good_movie positive
#2 sucks movie_sucks negative
#3 good plot movie_good good_plot movie_good_plot positive