R中文本数据中最常见的短语



这里有人有识别最常见短语(3~7个连续单词(的经验吗?要明白,大多数频率分析都集中在最频繁/最常见的单词上(以及绘制WordCloud(,而不是短语。

# Assuming a particular column in a data frame (df) with n rows that is all text data
# as I'm not able to provide a sample data as using dput() on a large text file won't # be feasible here 
Text = df$Text_Column
docs = Corpus(VectorSource(Text))
...

提前感谢!

R中有几个选项可以执行此操作。让我们先获取一些数据。我用了简·奥斯汀在janeaustenr上的书,并做了一些清洁,把每个段落都放在一排:

library(janeaustenr)
library(tidyverse)
books <- austen_books() %>% 
mutate(paragraph = cumsum(text == "" & lag(text) != "")) %>% 
group_by(paragraph) %>% 
summarise(book = head(book, 1),
text = trimws(paste(text, collapse = " ")),
.groups = "drop")

带有tidytext

library(tidytext)
map_df(3L:7L, ~unnest_tokens(books, ngram, text, token = "ngrams", n = .x)) %>% # using multiple values for n is not directly implemented in tidytext
count(ngram) %>%
filter(!is.na(ngram)) %>% 
slice_max(n, n = 10)
#> # A tibble: 10 × 2
#>    ngram               n
#>    <chr>           <int>
#>  1 i am sure         415
#>  2 i do not          412
#>  3 she could not     328
#>  4 it would be       258
#>  5 in the world      247
#>  6 as soon as        236
#>  7 a great deal      214
#>  8 would have been   211
#>  9 she had been      203
#> 10 it was a          202

使用量子达

library(quanteda)
books %>% 
corpus(docid_field = "paragraph",
text_field = "text") %>% 
tokens(remove_punct = TRUE,
remove_symbols = TRUE) %>% 
tokens_ngrams(n = 3L:7L) %>%
dfm() %>% 
topfeatures(n = 10) %>% 
enframe()
#> # A tibble: 10 × 2
#>    name            value
#>    <chr>           <dbl>
#>  1 i_am_sure         415
#>  2 i_do_not          412
#>  3 she_could_not     328
#>  4 it_would_be       258
#>  5 in_the_world      247
#>  6 as_soon_as        236
#>  7 a_great_deal      214
#>  8 would_have_been   211
#>  9 she_had_been      203
#> 10 it_was_a          202

使用text2vec

library(text2vec)
library(janeaustenr)
library(tidyverse)
books <- austen_books() %>% 
mutate(paragraph = cumsum(text == "" & lag(text) != "")) %>% 
group_by(paragraph) %>% 
summarise(book = head(book, 1),
text = trimws(paste(text, collapse = " ")),
.groups = "drop")
library(text2vec)
itoken(books$text, tolower, word_tokenizer) %>% 
create_vocabulary(ngram = c(3L, 7L), sep_ngram = " ") %>% 
filter(str_detect(term, "[[:alpha:]]")) %>% # keep terms with at tleas one alphabetic character
slice_max(term_count, n = 10)
#> Number of docs: 10293 
#> 0 stopwords:  ... 
#> ngram_min = 3; ngram_max = 7 
#> Vocabulary: 
#>                term term_count doc_count
#>  1:       i am sure        415       384
#>  2:        i do not        412       363
#>  3:   she could not        328       288
#>  4:     it would be        258       233
#>  5:    in the world        247       234
#>  6:      as soon as        236       233
#>  7:    a great deal        214       209
#>  8: would have been        211       192
#>  9:    she had been        203       179
#> 10:        it was a        202       194

创建于2022-08-03由reprex包(v2.0.1(

最新更新