无法理解 R 中正则表达式函数的逻辑,用于单词匹配



假设我有一个长字符,其中包括城市名称和州名。

test<-"Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology"

我的目标是提取它的所有城市名称。在一些帮助下,我通过申请实现了它:

pat="(,.\w+,)|(,.\w+.\w+,)"
gsub("(,\s)|,","",regmatches(m<-strsplit(test,"\|")[[1]],regexpr(pat,m)))

问题是现在我想对状态做同样的事情,但我无法完全理解上面代码的逻辑。有什么帮助吗?

您可以使用stringr中的str_extract_all

library(stringr)
str_extract_all(test, "(?<=,\s)[\w\s]+(?=,[\w\s]+(\||$))")

结果:

[[1]]
[1] "California"     "Connecticut"    "Massachusetts"  "Massachusetts"  "Missouri"       "New York"      
[7] "New York"       "North Carolina" "Ohio"           "Tennessee"      "Washington"     "Korea"         
[13] "Korea"          "Korea"          "Korea"          "Korea"  

笔记:

  1. [\w\s]+匹配任何单词字符或空格一次或多次

  2. (?<=,\s)是匹配逗号和空格的正面外观

  3. (?=,[\w\s]+(\||$))是将逗号、空格或单词字符匹配一次或多次以及字符串的|或结尾的正面展望

  4. 整个模式仅当任何单词字符或空格
  5. 跟在逗号和空格后面,后逗号、空格或单词字符一次或多次以及字符串的|或结尾时,才会匹配一次或多次。本质上,这与每个位置的倒数第二个元素匹配,以逗号分隔。

另一种方法是嵌套str_split方法,该方法按|拆分,并sapplystr_split到每个元素,第二次拆分,。此方法不需要包,但假定状态始终是每个位置的第三个元素:

unname(sapply(unlist(str_split(test, "\|")), 
function(x) unlist(str_split(x, ", "))[3]))

结果:

[1] "California"     "Connecticut"    "Massachusetts"  "Massachusetts"  "Missouri"       "New York"      
[7] "New York"       "North Carolina" "Ohio"           "Tennessee"      "Washington"     "Korea"         
[13] "Korea"          "Seoul"          "Korea"          "Korea"          NA 

请注意,最后一个元素是NA,因为它没有第三个元素。

最新更新