我正在尝试了解更多关于R中语料库和单词分析的信息。最近我开始使用CleanNLP和Spacy-Backend。问题是,在解析完文本后,我想看看一个句子是否有标记有不同关系的标记。
比方说,
library(cleanNLP)
library(tidyverse)
text <- cnlp_annotate(c("I gave him money"))
结果将是
doc_id sid tid token token_with_ws lemma upos xpos tid_source relation
<int> <int> <int> <chr> <chr> <chr> <chr> <chr> <int> <chr>
1 1 1 1 I "I " -PRON- PRON PRP 2 nsubj
2 1 1 2 gave "gave " give VERB VBD 0 root
3 1 1 3 money "money " money NOUN NN 2 dobj
4 1 1 4 to "to " to ADP IN 2 dative
5 1 1 5 him "him" -PRON- PRON PRP 4 pobj
我通过更改了数据帧
dative <- c("dative")
anno %>%
+ filter(grepl(dative, relation)) %>%
+ select(sid, sentence)
并查找前后上下文
anno %>%
+ mutate(kwic = ifelse(grepl(dative, relation),
+ TRUE, FALSE)) %>%
+ mutate(before = gsub("NA\s?", "", paste(lag(token, 3), lag(token, 2), lag(token))),
+ after = gsub("NA\s?", "", paste(lead(token), lead(token, 2), lead(token, 3)))
+ ) %>%
+ filter(kwic) %>%
+ select(before, token, after)
我想从语料库中提取所有三个关系标签(dobj, dative, pobj
(的句子。换句话说,如果前后上下文具有标记"dobj"
和"pobj"
,我想检查前后上下文并提取句子。
所以基本上,我想提取具有Dobj,Dative,Pobj模式的句子(具有双宾语的句子;我给了他钱(,但不具有其中一两个变量的模式,让我们只说Dobj;我给了钱或介词+Pobj;我给了他。
我该怎么做?感谢提供的任何帮助
到目前为止,在@GeoffreyPoole的大力帮助下,我终于拿到了这份名单。对下面的代码进行一些编辑后,输出为;
target <- "root dobj dative pobj"
text %>%
select(sid, relation, lemma) %>%
# get rid of any sentences with less than three words...
group_by(sid) %>%
summarize(n = n()) %>%
filter(n >= 4) %>%
left_join(text) %>%
# make sure tokens are in order...
arrange(sid, tid, lemma) %>%
# now, for each sentence...
group_by(sid) %>%
group_modify(
function(x,y,z) {
#paste together each triplet of relations and convert to a dataframe.
rollapply(x[,c("relation", "token")], 4, paste, collapse = " ") %>%
as.data.frame
}
) %>%
# get all unique combinations of sid and pasted triplets
distinct %>%
# select records with the desired pasted triplet
filter(relation == target) %>%
# and pull all of the tokens for associated sentences from text
left_join(text)
sid relation token doc_id tid token_with_ws lemma upos xpos tid_source
<int> <chr> <chr> <int> <int> <chr> <chr> <chr> <chr> <int>
1 949 root dobj dative pobj gives ideas to people NA NA NA NA NA NA NA
2 1242 root dobj dative pobj provided advantages for customers NA NA NA NA NA NA NA
3 1631 root dobj dative pobj give harm to themselves NA NA NA NA NA NA NA
4 2275 root dobj dative pobj say this to us NA NA NA NA NA NA NA
5 3016 root dobj dative pobj write fine to you NA NA NA NA NA NA NA
6 3826 root dobj dative pobj cause problem for society NA NA NA NA NA NA NA
7 4184 root dobj dative pobj gives harm to women NA NA NA NA NA NA NA
只剩下一个问题,我是否需要编辑target
以查看进一步的关系?例如当CCD_ 5,结果是
1242 root dobj dative pobj provided advantages for customers
如果实际的句子是,会发生什么
"为CCD_ 6客户提供了优势";
是否需要将target
重写为"root dobj dative (det) pobj"
才能观察到这些模式?
谢谢。
@Fatih提出的修改后的问题让我意识到,这个问题的答案比我最初发布的要有力(高效(得多。
关键是要使";句子";脱离词性而不是脱离表征(单词(本身。然后使用regex
(例如grepl()
(来找到"0";句子";具有所需图案。
以下是一些测试数据:
> text
# A tibble: 16 x 4
sid tid token upos
<int> <int> <chr> <chr>
1 1 1 When ADV
2 1 2 you PRON
3 1 3 ’re VERB
4 1 4 traveling VERB
5 2 1 You PRON
6 2 2 also ADV
7 2 3 see VERB
8 2 4 a DET
9 3 1 These DET
10 3 2 strings NOUN
11 3 3 of ADP
12 3 4 beads NOUN
13 4 1 They PRON
14 4 2 have AUX
15 4 3 been AUX
16 4 4 used VERB
假设我们想找到具有以下模式的句子:;ADV VERB";或";ADV PRON VERB";。正则表达式如下所示:
regex = "ADV (PRON )?VERB"
因此,让我们构建一些";句子";词性不足:
library(dplyr)
posSentences =
text %>%
arrange(sid, tid) %>%
group_by(sid) %>%
summarize(uposSentence = paste(upos, collapse = " "))
";句子";看起来像这样:
> posSentences
# A tibble: 4 x 2
sid uposSentence
<int> <chr>
1 1 ADV PRON VERB VERB
2 2 PRON ADV VERB DET
3 3 DET NOUN ADP NOUN
4 4 PRON AUX AUX VERB
你可以看到前两句有我们想要的模式。第二个没有。现在只需使用grepl
来找到符合正则表达式的:
theAnswer = filter(posSentences, grepl(regex, posSentences$uposSentence))
我们完成了:
> theAnswer
# A tibble: 2 x 2
sid uposSentence
<int> <chr>
1 1 ADV PRON VERB VERB
2 2 PRON ADV VERB DET
你可以用类似的东西回到这些句子中的标记:
filter(text, sid %in% theAnswer$sid)
在这种情况下产生:
# A tibble: 8 x 4
sid tid token upos
<int> <int> <chr> <chr>
1 1 1 When ADV
2 1 2 you PRON
3 1 3 ’re VERB
4 1 4 traveling VERB
5 2 1 You PRON
6 2 2 also ADV
7 2 3 see VERB
8 2 4 a DET
上述方法比我在@Fatih的问题范围较窄时提供的方法要快得多,也更灵活(寻找三个部分的特定模式(。所以我以前的答案是没有意义的,但我把它留在下面,以防对任何人有用。
原始答案(针对3个值的特定模式(
这是一个使用dplyr::group_modify
和zoo::rollapply
的解决方案。基本上,通过将rollapply
封装在group_modify
中,可以将每个句子中的rollapply
和paste
的每个三元组关系组合成一个字符串。然后,对于所需的target
字符串,简单地使用filter
。在运行此代码之前,您可能希望或不希望从text
中删除所有标点符号,具体取决于您的目标。
library(zoo)
library(dplyr)
target = "dobj dative pobj"
text %>%
select(sid, relation) %>%
# get rid of any sentences with less than three words...
group_by(sid) %>%
summarize(n = n()) %>%
filter(n >= 3) %>%
left_join(text) %>%
# make sure tokens are in order...
arrange(sid, tid) %>%
# now, for each sentence...
group_by(sid) %>%
group_modify(
function(x,y) {
#paste together each triplet of relations and convert to a dataframe.
rollapply(x[,"relation"], 3, paste, collapse = " ") %>%
as.data.frame
}
) %>%
# get all unique combinations of sid and pasted triplets
distinct %>%
# select records with the desired pasted triplet
filter(relation == target) %>%
# and pull all of the tokens for associated sentences from text
left_join(text)