r-丢弃包含嵌套目标词的较长字典匹配项

我正在使用tokens_lookup来查看某些文本是否包含字典中的单词。现在，我正试图找到一种方法来丢弃字典中单词处于有序单词序列中时发生的匹配。举个例子，假设爱尔兰在字典里。例如，我想排除提到北爱尔兰(或任何包含英国的固定单词集(的情况。我找到的唯一间接的解决方案是用这些单词组建立另一本词典(例如大不列颠(。然而，当同时引用英国和大不列颠时，这种解决方案将不起作用。非常感谢。

library("quanteda")
dict <- dictionary(list(IE = "Ireland"))
txt <- c(
doc1 = "Ireland lorem ipsum",
doc2 = "Lorem ipsum Northern Ireland",
doc3 = "Ireland lorem ipsum Northern Ireland"
)
toks <- tokens(txt)
tokens_lookup(toks, dictionary = dict)

您可以通过为"北爱尔兰"；，值也是"0"；北爱尔兰"；。如果在tokens_lookup()中使用参数nested_scope = "dictionary"，则这将首先匹配较长的短语，并且只匹配一次，分隔"；爱尔兰"；从"；北爱尔兰"；。通过使用与值相同的密钥，您可以替换它(附带的好处是，现在将这两个代币"Northern"one_answers"Ireland"组合为一个代币

library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dict <- dictionary(list(IE = "Ireland", "Northern Ireland" = "Northern Ireland"))
txt <- c(
doc1 = "Ireland lorem ipsum",
doc2 = "Lorem ipsum Northern Ireland",
doc3 = "Ireland lorem ipsum Northern Ireland"
)
toks <- tokens(txt)
tokens_lookup(toks,
dictionary = dict, exclusive = FALSE,
nested_scope = "dictionary", capkeys = FALSE
)
## Tokens consisting of 3 documents.
## doc1 :
## [1] "IE"    "lorem" "ipsum"
## 
## doc2 :
## [1] "Lorem"            "ipsum"            "Northern Ireland"
## 
## doc3 :
## [1] "IE"               "lorem"            "ipsum"            "Northern Ireland"

在这里，我使用exclusive = FALSE进行说明，这样您就可以看到查找和替换的内容。您可以在运行它时删除它和capkeys参数。

如果你想放弃"；北爱尔兰"；代币，只需使用

tokens_lookup(toks, dictionary = dict, nested_scope = "dictionary") %>%
tokens_remove("Northern Ireland")
## Tokens consisting of 3 documents.
## doc1 :
## [1] "IE"
## 
## doc2 :
## character(0)
## 
## doc3 :
## [1] "IE"

相关内容

最新更新

热门标签：