相当于 R 中的 Apache Lucene "proximity searches"



我正在处理一个文档语料库(住院期间的临床叙述),主要使用Quanteda包中。目标是能够根据特征的存在与否对文档进行分类,比如"痉挛咳嗽"。

我希望能够重现Apache Lucene的行为"邻近搜索"(https://lucene.apache.org/core/8_11_2/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Proximity_Searches)使用r

让我们以为例:1例91岁股骨颈手术后痉挛性和生产性咳嗽;

我将按如下方式开始标记这个短语:

toks = 
tokens(
c(text1 = "spastic and productive cough in a 91-year-old patient following femoral neck surgery"), 
remove_punct = T, remove_symbols = T, remove_numbers = T, padding = T
) %>% 
tokens_remove(pattern = stopwords("en",source = "nltk"))

生成以下输出:

Tokens consisting of 1 document.
text1 :
[1] "spastic"     "productive"  "cough"       "91-year-old" "patient"     "following"   "femoral"    
[8] "neck"        "surgery" 

我可以继续生成n-grams和skip-grams:

toks = tokens_ngrams(toks,n=4,skip = 0:3)
toks
[1] "spastic_productive_cough_91-year-old"     "spastic_productive_cough_patient"        
[3] "spastic_productive_cough_following"       "spastic_productive_cough_femoral"        
[5] "spastic_productive_91-year-old_patient"   "spastic_productive_91-year-old_following"
[7] "spastic_productive_91-year-old_femoral"   "spastic_productive_91-year-old_neck"     
[9] "spastic_productive_patient_following"     "spastic_productive_patient_femoral"      
[11] "spastic_productive_patient_neck"          "spastic_productive_patient_surgery"      
[13] "spastic_productive_following_femoral"     "spastic_productive_following_neck"       
[15] "spastic_productive_following_surgery"     "spastic_cough_91-year-old_patient"       
[17] "spastic_cough_91-year-old_following"      "spastic_cough_91-year-old_femoral"       
[19] "spastic_cough_91-year-old_neck"           "spastic_cough_patient_following"         
[21] "spastic_cough_patient_femoral"            "spastic_cough_patient_neck"              
[23] "spastic_cough_patient_surgery"            "spastic_cough_following_femoral"         
[25] "spastic_cough_following_neck"             "spastic_cough_following_surgery"         
[27] "spastic_cough_femoral_neck"               "spastic_cough_femoral_surgery"           
[29] "spastic_91-year-old_patient_following"    "spastic_91-year-old_patient_femoral"     
[31] "spastic_91-year-old_patient_neck"         "spastic_91-year-old_patient_surgery"     
.........

在这一点上,我想我可以简单地:

any(str_detect(as.character(toks),"spastic_cough"))
[1] TRUE

但是我不确定我是否使用了正确的方法,因为与Lucene查询的工作方式相比,它感觉很笨拙。如果我试图鉴别患有痉挛性咳嗽的病人;使用Apache Lucene查询语料库,我可能会使用"痉挛咳嗽"~3在"~3"表示任何skip-gram 0:3都将匹配。

关于我如何以及在哪里可以改进我的方法的任何输入?

编辑:

可以这样做:https://search.r-project.org/CRAN/refmans/corpustools/html/search_features.html

但是,目前,我不知道如何将它包含在工作流程中。

编辑2:

似乎我可以使用subset_query使用Lucene类语法查询语料库。我现在面临的大问题是">粪便"。不接受作为输入的令牌对象,函数tokens_to_corpus()对我不起作用。这使我无法控制标记化过程

实际上,在深入研究文档之后,"语料库"包提供了所有我需要的Apache Lucene在R =)的经验

最新更新