我想显式替换包quanteda的tokens
类对象中定义的特定令牌。我无法复制适用于stringr的标准方法。
目标是将形式"XXXof"
的所有令牌替换为形式c("XXX", "of")
的两个令牌。
请看一下下面的最小值:
suppressPackageStartupMessages(library(quanteda))
library(stringr)
text = "It was a beautiful day down to the coastof California."
# I would solve this with stringr as follows:
text_stringr = str_replace( text, "(^.*?)(of)", "\1 \2" )
text_stringr
#> [1] "It was a beautiful day down to the coast of California."
# I fail to find a similar solution with quanteda that works on objects of class tokens
tok = tokens( text )
# I want to replace "coastof" with "coast"
tokens_replace( tok, "(^.*?)(of)", "\1 \2", valuetype = "regex" )
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "It" "was" "a" "beautiful" "day"
#> [6] "down" "to" "the" "\1 \2" "California"
#> [11] "."
有什么变通办法吗?
由reprex软件包(v1.0.0(于2021-03-16创建
您可以使用混合物来构建需要分隔的单词及其分隔形式的列表,然后使用tokens_replace()
来执行替换。这样做的好处是,你可以在应用列表之前对其进行策划,这意味着你可以验证你没有发现你可能不想应用的替代品。
suppressPackageStartupMessages(library("quanteda"))
toks <- tokens("It was a beautiful day down to the coastof California.")
keys <- as.character(tokens_select(toks, "(^.*?)(of)", valuetype = "regex"))
vals <- stringr::str_replace(keys, "(^.*?)(of)", "\1 \2") %>%
strsplit(" ")
keys
## [1] "coastof"
vals
## [[1]]
## [1] "coast" "of"
tokens_replace(toks, keys, vals)
## Tokens consisting of 1 document.
## text1 :
## [1] "It" "was" "a" "beautiful" "day"
## [6] "down" "to" "the" "coast" "of"
## [11] "California" "."
由reprex软件包(v1.0.0(于2021-03-16创建