我正在处理大量的政治演讲,我想创建两个子集。第一个应该包含特定关键字列表中的一个或多个。"移民"、"移民"、"避难")。第二个子集应该包含不包含任何这些术语的文档(不属于第一个子集的演讲)。
对此的任何意见都将非常感谢。谢谢!
#first suggestion
> corp_labcon$criteria <- ifelse(stringi::stri_detect_regex(corp_labcon, pattern=paste0(regex_pattern), ignore_case = TRUE, collapse="|"), "yes", "no")
Warning messages:
1: In (function (case_insensitive, comments, dotall, dot_all = dotall, :
Unknown option to `stri_opts_regex`.
2: In stringi::stri_detect_regex(corp_labcon, pattern = paste0(regex_pattern), :
longer object length is not a multiple of shorter object length
> table(corp_labcon$criteria)
no yes
556921 6139
#Second suggestion
> corp_labcon$criteria <- ifelse(stringi::stri_detect_regex(corp_labcon, pattern = paste0(glob2rx(regex_pattern), collapse = "|")), "yes","no")
> table(corp_labcon$criteria)
no
563060
您没有给出一个可重复的示例,但是我将展示如何使用quanteda和可用的语料库data_corpus_开场白来完成它。您可以使用可以附加到语料库的文档。这就像在data.frame中添加一个变量。
使用stringi::stri_detect_regex
,查看每个文档中是否有任何查找的单词在文本中,如果有,则将criteria列中的值设置为yes。否则为no。之后,您可以使用corpus_subset
根据标准值创建2个新公司。参见下面的示例代码。
library(quanteda)
# words used in regex selection
regex_pattern <- c("migrant*", "migration*", "asylum*")
# add selection to corpus
data_corpus_inaugural$criteria <- ifelse(stringi::stri_detect_regex(data_corpus_inaugural,
pattern = paste0(regex_pattern,
collapse = "|")),
"yes","no")
# Check docvars and new criteria column
head(docvars(data_corpus_inaugural))
Year President FirstName Party criteria
1 1789 Washington George none yes
2 1793 Washington George none no
3 1797 Adams John Federalist no
4 1801 Jefferson Thomas Democratic-Republican no
5 1805 Jefferson Thomas Democratic-Republican no
6 1809 Madison James Democratic-Republican no
# split corpus into segment 1 and 2
segment1 <- corpus_subset(data_corpus_inaugural, criteria == "yes")
segment2 <- corpus_subset(data_corpus_inaugural, criteria == "no")
不确定您的数据是如何组织的,但您可以尝试函数grep()。假设数据是一个数据帧,每一行都是一个文本,您可以尝试:
words <- c("migrant", "migration", "asylum")
df[grep(words, df$text),] # This will give you those lines with the words
df[!grep(words, df$text),] # This will give you those lines without the words
也许,你的数据不是这样结构化的!你应该更好地解释你的数据是什么样子的。