我有一个这样的字符向量:
sent <- c("The quick brown fox jumps over the lazy dog.",
"Over the lazy dog jumped the quick brown fox.",
"The quick brown fox jumps over the lazy dog.")
并且我使用CCD_ 1来生成如下的双图:
txt <- textcnt(sent, method = "string", split = " ", n=2, tolower = FALSE)
format(txt)
给了我所有的二元图
frq rank bytes Encoding
Over the 1 4.5 8 unknown
The quick 2 11.5 9 unknown
brown fox 2 11.5 9 unknown
brown fox. 1 4.5 10 unknown
dog jumped 1 4.5 10 unknown
dog. Over 1 4.5 9 unknown
fox jumps 2 11.5 9 unknown
fox. The 1 4.5 8 unknown
jumped the 1 4.5 10 unknown
jumps over 2 11.5 10 unknown
lazy dog 1 4.5 8 unknown
lazy dog. 2 11.5 9 unknown
over the 2 11.5 8 unknown
quick brown 3 15.5 11 unknown
the lazy 3 15.5 8 unknown
the quick 1 4.5 9 unknown
真实数据有更多的句子。我有两个问题:
1.是否可以提到,在生成的ngrams中,每个句子末尾的点都应该被截断
2.是否有可能阻止跨越两句的ngrams的生成?dog. Over
和fox. The
您可以通过避免texcnt来避免textcnt中的特定ngrams。:-)为了充实@lukeA的评论,以下是完整的quanteda解决方案。
require(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.5.19’
这将把符号化为bigrams,同时去掉标点符号。因为每一句话都是一个"文档",所以bigram永远不会跨越文档。
(bigramToks <- tokenize(sent, ngrams = 2, removePunct = TRUE, concatenator = " "))
tokenizedText object from 3 documents.
## Component 1 :
## [1] "The quick" "quick brown" "brown fox" "fox jumps" "jumps over" "over the" "the lazy" "lazy dog"
##
## Component 2 :
## [1] "Over the" "the lazy" "lazy dog" "dog jumped" "jumped the" "the quick" "quick brown" "brown fox"
##
## Component 3 :
## [1] "The quick" "quick brown" "brown fox" "fox jumps" "jumps over" "over the" "the lazy" "lazy dog"
为了获得这些频率,您应该通过使用dfm()
构建文档特征矩阵来将bigrams标记制成表格。(注意:您可以跳过标记化步骤,直接使用dfm(sent, ngrams = 2, toLower = FALSE, concatenator = " ")
完成此操作。)
(bigramDfm <- dfm(bigramToks, toLower = FALSE, verbose = FALSE))
## Document-feature matrix of: 3 documents, 12 features.
## 3 x 12 sparse Matrix of class "dfmSparse"
## features
## docs The quick quick brown brown fox fox jumps jumps over over the the lazy lazy dog Over the dog jumped
## text1 1 1 1 1 1 1 1 1 0 0
## text2 0 1 1 0 0 0 1 1 1 1
## text3 1 1 1 1 1 1 1 1 0 0
## features
## docs jumped the the quick
## text1 0 0
## text2 1 1
## text3 0 0
topfeatures(bigramDfm, n = nfeature(bigramDfm))
## quick brown brown fox the lazy lazy dog The quick fox jumps jumps over over the Over the
## 3 3 3 3 2 2 2 2 1
## dog jumped jumped the the quick
## 1 1 1