小贝子编程

r语言 - Quanteda的corpus_reshape函数：如何在缩写后不破坏句子(如"e.g.")

本文关键字：缩写句子 Quanteda r语言 corpus 函数 reshape r quanteda
更新时间 : 2023-09-17
英文 : r - Quanteda's corpus_reshape function: how not to break sentences after abbreviations (like "e.g.")

我正在使用Quanteda(v. 2.0.9000(在R(v. 4.0.0(中进行文本分析。

我使用corpus_reshape函数将文本拆分为句子，但我注意到该函数不仅在句子末尾破坏文档，而且当有一个带点的缩写(例如"例如"、"即"、"美国"(后跟大写字母或数字时。

有没有办法防止这些特定的分裂？一种告诉函数的方法："拆分文本，但不是在点之前的字符是"例如"、"即"、"u.s"时

？提前感谢您的帮助！

没有问题的例子，我无法解决它，但在我的系统上，它工作正常。

library("quanteda")
## Package version: 2.1.0
txt <- c(
d1 = "This is an example, e.g. something.  Whatever, i.e. something.",
d2 = "The U.S. is south of Canada."
)
corpus(txt) %>%
corpus_reshape(to = "sentences")
## Corpus consisting of 3 documents.
## d1.1 :
## "This is an example, e.g. something."
## 
## d1.2 :
## "Whatever, i.e. something."
## 
## d2.1 :
## "The U.S. is south of Canada."

r语言 - Quanteda的corpus_reshape函数：如何在缩写后不破坏句子(如"e.g.")

相关内容

最新更新

热门标签：