r-如何禁止ngrams中的标点符号和空格



我有一个这样的字符向量:

sent <- c("The quick brown fox jumps over the lazy dog.",
          "Over the lazy dog jumped the quick brown fox.",
          "The quick brown fox jumps over the lazy dog.")

并且我使用CCD_ 1来生成如下的双图:

txt <- textcnt(sent, method = "string", split = " ", n=2, tolower = FALSE)

format(txt)给了我所有的二元图

              frq rank  bytes Encoding
Over the      1   4.5   8     unknown
The quick     2   11.5  9     unknown
brown fox     2   11.5  9     unknown
brown fox.    1   4.5   10    unknown
dog jumped    1   4.5   10    unknown
dog. Over     1   4.5   9     unknown
fox jumps     2   11.5  9     unknown
fox. The      1   4.5   8     unknown
jumped the    1   4.5   10    unknown
jumps over    2   11.5  10    unknown
lazy dog      1   4.5   8     unknown
lazy dog.     2   11.5  9     unknown
over the      2   11.5  8     unknown
quick brown   3   15.5  11    unknown
the lazy      3   15.5  8     unknown
the quick     1   4.5   9     unknown  

真实数据有更多的句子。我有两个问题:
1.是否可以提到,在生成的ngrams中,每个句子末尾的点都应该被截断
2.是否有可能阻止跨越两句的ngrams的生成?dog. Overfox. The

您可以通过避免texcnt来避免textcnt中的特定ngrams。:-)为了充实@lukeA的评论,以下是完整的quanteda解决方案。

require(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.5.19’

这将把符号化为bigrams,同时去掉标点符号。因为每一句话都是一个"文档",所以bigram永远不会跨越文档。

(bigramToks <- tokenize(sent, ngrams = 2, removePunct = TRUE, concatenator = " "))
tokenizedText object from 3 documents.
## Component 1 :
## [1] "The quick"   "quick brown" "brown fox"   "fox jumps"   "jumps over"  "over the"    "the lazy"    "lazy dog"   
## 
## Component 2 :
## [1] "Over the"    "the lazy"    "lazy dog"    "dog jumped"  "jumped the"  "the quick"   "quick brown" "brown fox"  
## 
## Component 3 :
## [1] "The quick"   "quick brown" "brown fox"   "fox jumps"   "jumps over"  "over the"    "the lazy"    "lazy dog"   

为了获得这些频率,您应该通过使用dfm()构建文档特征矩阵来将bigrams标记制成表格。(注意:您可以跳过标记化步骤,直接使用dfm(sent, ngrams = 2, toLower = FALSE, concatenator = " ")完成此操作。)

(bigramDfm <- dfm(bigramToks, toLower = FALSE, verbose = FALSE))
## Document-feature matrix of: 3 documents, 12 features.
## 3 x 12 sparse Matrix of class "dfmSparse"
##        features
## docs    The quick quick brown brown fox fox jumps jumps over over the the lazy lazy dog Over the dog jumped
##   text1         1           1         1         1          1        1        1        1        0          0
##   text2         0           1         1         0          0        0        1        1        1          1
##   text3         1           1         1         1          1        1        1        1        0          0
## features
## docs    jumped the the quick
##   text1          0         0
##   text2          1         1
##   text3          0         0
topfeatures(bigramDfm, n = nfeature(bigramDfm))
## quick brown   brown fox    the lazy    lazy dog   The quick   fox jumps  jumps over    over the    Over the 
##           3           3           3           3           2           2           2           2           1 
##  dog jumped  jumped the   the quick 
##           1           1           1 

最新更新