r-在表示为单个标记的数据帧的句子中搜索一系列有序标记



我正在尝试了解更多关于R中语料库和单词分析的信息。最近我开始使用CleanNLP和Spacy-Backend。问题是,在解析完文本后,我想看看一个句子是否有标记有不同关系的标记。

比方说,

library(cleanNLP)
library(tidyverse)
text <- cnlp_annotate(c("I gave him money"))

结果将是

doc_id   sid   tid token token_with_ws lemma  upos  xpos  tid_source relation
<int> <int> <int> <chr> <chr>         <chr>  <chr> <chr>      <int> <chr>   
1      1     1     1 I     "I "          -PRON- PRON  PRP            2 nsubj   
2      1     1     2 gave  "gave "       give   VERB  VBD            0 root    
3      1     1     3 money "money "      money  NOUN  NN             2 dobj    
4      1     1     4 to    "to "         to     ADP   IN             2 dative  
5      1     1     5 him   "him"         -PRON- PRON  PRP            4 pobj 

我通过更改了数据帧

dative <- c("dative")
anno %>%
+     filter(grepl(dative, relation)) %>% 
+     select(sid, sentence)

并查找前后上下文

anno %>%
+     mutate(kwic = ifelse(grepl(dative, relation),
+                          TRUE, FALSE)) %>%
+     mutate(before = gsub("NA\s?", "", paste(lag(token, 3), lag(token, 2), lag(token))),
+            after = gsub("NA\s?", "", paste(lead(token), lead(token, 2), lead(token, 3)))
+     ) %>%
+     filter(kwic) %>%
+     select(before, token, after)

我想从语料库中提取所有三个关系标签(dobj, dative, pobj(的句子。换句话说,如果前后上下文具有标记"dobj""pobj",我想检查前后上下文并提取句子。

所以基本上,我想提取具有Dobj,Dative,Pobj模式的句子(具有双宾语的句子;我给了他钱(,但不具有其中一两个变量的模式,让我们只说Dobj;我给了钱或介词+Pobj;我给了他。

我该怎么做?感谢提供的任何帮助

到目前为止,在@GeoffreyPoole的大力帮助下,我终于拿到了这份名单。对下面的代码进行一些编辑后,输出为;

target <- "root dobj dative pobj"
text %>%
select(sid, relation, lemma) %>%

# get rid of any sentences with less than three words...
group_by(sid) %>%
summarize(n = n()) %>%
filter(n >= 4) %>%
left_join(text) %>%

# make sure tokens are in order...
arrange(sid, tid, lemma) %>%

# now, for each sentence...
group_by(sid) %>%
group_modify(
function(x,y,z) {
#paste together each triplet of relations and convert to a dataframe.
rollapply(x[,c("relation", "token")], 4, paste, collapse = " ") %>%
as.data.frame
}
) %>% 

# get all unique combinations of sid and pasted triplets
distinct %>%

# select records with the desired pasted triplet
filter(relation == target) %>%

# and pull all of the tokens for associated sentences from text
left_join(text)
sid relation              token                             doc_id   tid token_with_ws lemma upos  xpos  tid_source
<int> <chr>                 <chr>                              <int> <int> <chr>         <chr> <chr> <chr>      <int>
1   949 root dobj dative pobj gives ideas to people                 NA    NA NA            NA    NA    NA            NA
2  1242 root dobj dative pobj provided advantages for customers     NA    NA NA            NA    NA    NA            NA
3  1631 root dobj dative pobj give harm to themselves               NA    NA NA            NA    NA    NA            NA
4  2275 root dobj dative pobj say this to us                        NA    NA NA            NA    NA    NA            NA
5  3016 root dobj dative pobj write fine to you                     NA    NA NA            NA    NA    NA            NA
6  3826 root dobj dative pobj cause problem for society             NA    NA NA            NA    NA    NA            NA
7  4184 root dobj dative pobj gives harm to women                   NA    NA NA            NA    NA    NA            NA

只剩下一个问题,我是否需要编辑target以查看进一步的关系?例如当CCD_ 5,结果是

1242 root dobj dative pobj provided advantages for customers

如果实际的句子是,会发生什么

"为CCD_ 6客户提供了优势";

是否需要将target重写为"root dobj dative (det) pobj"才能观察到这些模式?

谢谢。

@Fatih提出的修改后的问题让我意识到,这个问题的答案比我最初发布的要有力(高效(得多。

关键是要使";句子";脱离词性而不是脱离表征(单词(本身。然后使用regex(例如grepl()(来找到"0";句子";具有所需图案。

以下是一些测试数据:

> text
# A tibble: 16 x 4
sid   tid token     upos 
<int> <int> <chr>     <chr>
1     1     1 When      ADV  
2     1     2 you       PRON 
3     1     3 ’re       VERB 
4     1     4 traveling VERB 
5     2     1 You       PRON 
6     2     2 also      ADV  
7     2     3 see       VERB 
8     2     4 a         DET  
9     3     1 These     DET  
10     3     2 strings   NOUN 
11     3     3 of        ADP  
12     3     4 beads     NOUN 
13     4     1 They      PRON 
14     4     2 have      AUX  
15     4     3 been      AUX  
16     4     4 used      VERB 

假设我们想找到具有以下模式的句子:;ADV VERB";或";ADV PRON VERB";。正则表达式如下所示:

regex = "ADV (PRON )?VERB"

因此,让我们构建一些";句子";词性不足:

library(dplyr)
posSentences = 
text %>%
arrange(sid, tid) %>%
group_by(sid) %>%
summarize(uposSentence = paste(upos, collapse = " "))

";句子";看起来像这样:

> posSentences
# A tibble: 4 x 2
sid uposSentence      
<int> <chr>             
1     1 ADV PRON VERB VERB
2     2 PRON ADV VERB DET 
3     3 DET NOUN ADP NOUN 
4     4 PRON AUX AUX VERB

你可以看到前两句有我们想要的模式。第二个没有。现在只需使用grepl来找到符合正则表达式的:

theAnswer = filter(posSentences, grepl(regex, posSentences$uposSentence))

我们完成了:

> theAnswer
# A tibble: 2 x 2
sid uposSentence      
<int> <chr>             
1     1 ADV PRON VERB VERB
2     2 PRON ADV VERB DET 

你可以用类似的东西回到这些句子中的标记:

filter(text, sid %in% theAnswer$sid)

在这种情况下产生:

# A tibble: 8 x 4
sid   tid token     upos 
<int> <int> <chr>     <chr>
1     1     1 When      ADV  
2     1     2 you       PRON 
3     1     3 ’re       VERB 
4     1     4 traveling VERB 
5     2     1 You       PRON 
6     2     2 also      ADV  
7     2     3 see       VERB 
8     2     4 a         DET  

上述方法比我在@Fatih的问题范围较窄时提供的方法要快得多,也更灵活(寻找三个部分的特定模式(。所以我以前的答案是没有意义的,但我把它留在下面,以防对任何人有用。


原始答案(针对3个值的特定模式(


这是一个使用dplyr::group_modifyzoo::rollapply的解决方案。基本上,通过将rollapply封装在group_modify中,可以将每个句子中的rollapplypaste的每个三元组关系组合成一个字符串。然后,对于所需的target字符串,简单地使用filter。在运行此代码之前,您可能希望或不希望从text中删除所有标点符号,具体取决于您的目标。

library(zoo)
library(dplyr)
target = "dobj dative pobj"
text %>%
select(sid, relation) %>%
# get rid of any sentences with less than three words...
group_by(sid) %>%
summarize(n = n()) %>%
filter(n >= 3) %>%
left_join(text) %>%
# make sure tokens are in order...
arrange(sid, tid) %>%
# now, for each sentence...
group_by(sid) %>%
group_modify(
function(x,y) {
#paste together each triplet of relations and convert to a dataframe.
rollapply(x[,"relation"], 3, paste, collapse = " ") %>%
as.data.frame
}
) %>% 
# get all unique combinations of sid and pasted triplets
distinct %>%
# select records with the desired pasted triplet
filter(relation == target) %>%
# and pull all of the tokens for associated sentences from text
left_join(text)

最新更新