我正在处理一个文档语料库(住院期间的临床叙述),主要使用Quanteda包中。目标是能够根据特征的存在与否对文档进行分类,比如"痉挛咳嗽"。
我希望能够重现Apache Lucene的行为"邻近搜索"(https://lucene.apache.org/core/8_11_2/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Proximity_Searches)使用r
让我们以为例:1例91岁股骨颈手术后痉挛性和生产性咳嗽;
我将按如下方式开始标记这个短语:
toks =
tokens(
c(text1 = "spastic and productive cough in a 91-year-old patient following femoral neck surgery"),
remove_punct = T, remove_symbols = T, remove_numbers = T, padding = T
) %>%
tokens_remove(pattern = stopwords("en",source = "nltk"))
生成以下输出:
Tokens consisting of 1 document.
text1 :
[1] "spastic" "productive" "cough" "91-year-old" "patient" "following" "femoral"
[8] "neck" "surgery"
我可以继续生成n-grams和skip-grams:
toks = tokens_ngrams(toks,n=4,skip = 0:3)
toks
[1] "spastic_productive_cough_91-year-old" "spastic_productive_cough_patient"
[3] "spastic_productive_cough_following" "spastic_productive_cough_femoral"
[5] "spastic_productive_91-year-old_patient" "spastic_productive_91-year-old_following"
[7] "spastic_productive_91-year-old_femoral" "spastic_productive_91-year-old_neck"
[9] "spastic_productive_patient_following" "spastic_productive_patient_femoral"
[11] "spastic_productive_patient_neck" "spastic_productive_patient_surgery"
[13] "spastic_productive_following_femoral" "spastic_productive_following_neck"
[15] "spastic_productive_following_surgery" "spastic_cough_91-year-old_patient"
[17] "spastic_cough_91-year-old_following" "spastic_cough_91-year-old_femoral"
[19] "spastic_cough_91-year-old_neck" "spastic_cough_patient_following"
[21] "spastic_cough_patient_femoral" "spastic_cough_patient_neck"
[23] "spastic_cough_patient_surgery" "spastic_cough_following_femoral"
[25] "spastic_cough_following_neck" "spastic_cough_following_surgery"
[27] "spastic_cough_femoral_neck" "spastic_cough_femoral_surgery"
[29] "spastic_91-year-old_patient_following" "spastic_91-year-old_patient_femoral"
[31] "spastic_91-year-old_patient_neck" "spastic_91-year-old_patient_surgery"
.........
在这一点上,我想我可以简单地:
any(str_detect(as.character(toks),"spastic_cough"))
[1] TRUE
但是我不确定我是否使用了正确的方法,因为与Lucene查询的工作方式相比,它感觉很笨拙。如果我试图鉴别患有痉挛性咳嗽的病人;使用Apache Lucene查询语料库,我可能会使用"痉挛咳嗽"~3在"~3"表示任何skip-gram 0:3都将匹配。
关于我如何以及在哪里可以改进我的方法的任何输入?
编辑:
可以这样做:https://search.r-project.org/CRAN/refmans/corpustools/html/search_features.html
但是,目前,我不知道如何将它包含在工作流程中。
编辑2:
似乎我可以使用subset_query使用Lucene类语法查询语料库。我现在面临的大问题是">粪便"。不接受作为输入的令牌对象,函数tokens_to_corpus()对我不起作用。这使我无法控制标记化过程
实际上,在深入研究文档之后,"语料库"包提供了所有我需要的Apache Lucene在R =)的经验