r语言 - 在Quanteda中不工作的复合词的标记化 - r - Tokenization of Compound Words not Working in Quanteda 小贝子编程网

我试图使用kwic()函数创建一个包含上下文中特定关键字的数据框，但不幸的是，在尝试标记底层数据集时，我遇到了一些错误。

这是我使用的数据集的子集，作为一个可复制的例子:

test_cluster <- speeches_subset %>%
filter(grepl('Schwester Agnes',
speechContent,
ignore.case = TRUE))
test_corpus <- corpus(test_cluster,
docid_field = "id",
text_field = "speechContent")

这里，test_cluster包含12个变量的6个观测值，即6行，其中列speechContent包含复合词";Schwester Agnes"test_corpus将底层数据转换为quanteda语料库对象。

当我运行下面的代码时，我希望，首先，speechContent变量的内容被标记化，并且由于tokens_compound，复合词"Schwester Agnes"被这样标记化在第二步中，我希望kwic()函数返回一个由六行组成的数据帧，其中keyword变量包括复合词"Schwester Agnes"。但是，kwic()返回一个空数据帧，其中包含7个变量的0个观察值。我认为这是因为我用tokens_compound()犯了一些错误，但我不确定……任何帮助将非常感激!

test_tokens <- tokens(test_corpus, 
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = phrase("Schwester Agnes"))
test_kwic <- kwic(test_tokens,
pattern = "Schwester Agnes",
window = 5)

编辑:我意识到上面的例子不容易重复，所以请参考下面的代表:

speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3, 
speechContent = speech)
test_corpus <- corpus(data,
docid_field = "id",
text_field = "speechContent")
test_tokens <- tokens(test_corpus, 
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = c("stack", "overflow"))
test_kwic <- kwic(test_tokens,
pattern = "stack overflow",
window = 5)

您需要应用phrase("stack overflow")，并在tokens_compound()中设置concatenator = " "

require(quanteda)
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
speech <- c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", 
"This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", 
"this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id = 1:3, 
speechContent = speech)
test_corpus <- corpus(data,
docid_field = "id",
text_field = "speechContent")
test_tokens <- tokens(test_corpus, 
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = phrase("stack overflow"), concatenator = " ")
test_kwic <- kwic(test_tokens,
pattern = "stack overflow",
window = 5)
test_kwic
#> Keyword-in-context with 2 matches.                                                                             
#>  [1, 29] for example is the word | stack overflow | However there are so many
#>  [2, 24]     but at the very end | stack overflow |

^{在2022-05-06由reprex包(v2.0.1)创建}

r语言 - 在Quanteda中不工作的复合词的标记化

相关内容

最新更新

热门标签：