r-通过regex替换quanteda令牌



我想显式替换包quantedatokens类对象中定义的特定令牌。我无法复制适用于stringr的标准方法。

目标是将形式"XXXof"的所有令牌替换为形式c("XXX", "of")的两个令牌。

请看一下下面的最小值:

suppressPackageStartupMessages(library(quanteda))
library(stringr)
text = "It was a beautiful day down to the coastof California."
# I would solve this with stringr as follows: 
text_stringr = str_replace( text, "(^.*?)(of)", "\1 \2" )
text_stringr
#> [1] "It was a beautiful day down to the coast of California."
# I fail to find a similar solution with quanteda that works on objects of class tokens
tok = tokens( text )
# I want to replace "coastof" with "coast"
tokens_replace( tok, "(^.*?)(of)", "\1 \2", valuetype = "regex" )
#> Tokens consisting of 1 document.
#> text1 :
#>  [1] "It"         "was"        "a"          "beautiful"  "day"       
#>  [6] "down"       "to"         "the"        "\1 \2"    "California"
#> [11] "."

有什么变通办法吗?

由reprex软件包(v1.0.0(于2021-03-16创建

您可以使用混合物来构建需要分隔的单词及其分隔形式的列表,然后使用tokens_replace()来执行替换。这样做的好处是,你可以在应用列表之前对其进行策划,这意味着你可以验证你没有发现你可能不想应用的替代品。

suppressPackageStartupMessages(library("quanteda"))
toks <- tokens("It was a beautiful day down to the coastof California.")
keys <- as.character(tokens_select(toks, "(^.*?)(of)", valuetype = "regex"))
vals <- stringr::str_replace(keys, "(^.*?)(of)", "\1 \2") %>%
strsplit(" ")
keys
## [1] "coastof"
vals
## [[1]]
## [1] "coast" "of"
tokens_replace(toks, keys, vals)
## Tokens consisting of 1 document.
## text1 :
##  [1] "It"         "was"        "a"          "beautiful"  "day"       
##  [6] "down"       "to"         "the"        "coast"      "of"        
## [11] "California" "."

由reprex软件包(v1.0.0(于2021-03-16创建

相关内容

  • 没有找到相关文章

最新更新