r-如何从文章中提取文本下方和上方的关键词



我有一个日记中的行的字符向量:

test_1 <- c("                                                                  Journal of Neonatal Nursing 27 (2021) 106–110", 
"                                                                     Contents lists available at ScienceDirect", 
"                                                               Journal of Neonatal Nursing", 
"                                                              journal homepage: www.elsevier.com/locate/jnn", 
"Comparison of inter-facility transports of critically ill neonates who died", 
"after admission vs. survivors", "Robert Schultz a, *, Jennifer Berk-King a, Laura Wallace a, Girija Natarajan a, b", 
"a", "  Children’s Hospital of Michigan, Detroit, MI, USA", 
"b", "  Division of Neonatology, Wayne State University School of Medicine, Detroit, MI, USA", 
"A R T I C L E I N F O                                       A B S T R A C T", 
"Keywords:                                                   Objective: To compare characteristics before, during and after inter-facility transports (IFT), and changes in the", 
"Inter-facility transport                                    Transport Risk Index of Physiologic Stability (TRIPS) before and after inter-facility transports (IFT) in infants", 
"Neonatal intensive care                                     who died within 7 days of admission to a level IV NICU versus matched survivors.", 
"Mortality", "                                                            Study design: This retrospective case-control study included infants who died within 7 days of IFT and controls", 
"                                                            matched for gestational age and reason for admission. Unplanned events were temperature or respiratory de­", 
"                                                            rangements. Therapeutic interventions included increased respiratory support, resuscitation or blood product", 
"                                                            transfusion.", 
"                                                            Results: Our cohort was predominantly preterm and male. Cases had a higher rate of resuscitation, lower Apgar", 
"                                                            scores, more respiratory acidosis, lower BP and higher TRIPS, compared to controls. Deterioration in TRIPS was", 
"                                                            independently associated with male gender and unplanned events; not with patient group.", 
"                                                            Conclusions: Rates of unplanned events, therapeutic interventions, and deterioration in TRIPS following IFT by a", 
"                                                            transport team are comparable in cases and controls.", 
"                                                                                              outcomes. The Transport Risk Index of Physiologic Stability (TRIPS) is", 
"1. Introduction                                                                               an assessment measure of infant status before and after transport (Lee"
)

我想从这些行中提取关键字,它们是Inter-facility transportNeonatal intensive careMortality。我试着得到一条";关键词";使用test_1[str_detect(test_1, "^Keywords:")],我想获得此行以下和1. Introduction以上的所有关键字

什么regexstringr函数将执行此操作?

感谢

如果我理解正确,您正在扫描从这里下载的pdf。我认为你应该找到一个更好的方法来扫描你的PDF。

在那之前,最好的选择可能是:

library(stringr)
# get the line after ^Keywords:
start <- which(str_detect(test_1, "^Keywords:")) +1
# get the line before ^1. Introduction
end <- which(str_detect(test_1, "^1. Introduction")) -1
# get the lines in between
x <- test_1[start:end]
# Extract keywords
x <- str_trim(str_sub(x, 1, 60))
x <- x[x!=""]
x
#> [1] "Inter-facility transport" "Neonatal intensive care"  "Mortality"

编辑

您可以定义一个函数来查找发生Keywords的行的索引以及该行以下行的索引:

find_keywords <- function(pattern, text) {
index <- which(grepl(pattern, text))  
sort(c(index + 1, index + 2, index + 3)) # If you suspect there are more than three keywords, then just `index + ...`
}

基于该功能,您可以提取关键字:

library(stringr)
str_extract(test_1[find_keywords(pattern = "^Keywords:", text = test_1)], "^\S+")
[1] "Inter-facility" "Neonatal"       "Mortality"

相关内容

  • 没有找到相关文章

最新更新