假设我有一个长字符,其中包括城市名称和州名。
test<-"Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology"
我的目标是提取它的所有城市名称。在一些帮助下,我通过申请实现了它:
pat="(,.\w+,)|(,.\w+.\w+,)"
gsub("(,\s)|,","",regmatches(m<-strsplit(test,"\|")[[1]],regexpr(pat,m)))
问题是现在我想对状态做同样的事情,但我无法完全理解上面代码的逻辑。有什么帮助吗?
您可以使用stringr
中的str_extract_all
:
library(stringr)
str_extract_all(test, "(?<=,\s)[\w\s]+(?=,[\w\s]+(\||$))")
结果:
[[1]]
[1] "California" "Connecticut" "Massachusetts" "Massachusetts" "Missouri" "New York"
[7] "New York" "North Carolina" "Ohio" "Tennessee" "Washington" "Korea"
[13] "Korea" "Korea" "Korea" "Korea"
笔记:
[\w\s]+
匹配任何单词字符或空格一次或多次(?<=,\s)
是匹配逗号和空格的正面外观(?=,[\w\s]+(\||$))
是将逗号、空格或单词字符匹配一次或多次以及字符串的|
或结尾的正面展望
整个模式仅当任何单词字符或空格跟在逗号和空格后面,后跟逗号、空格或单词字符一次或多次以及字符串的
|
或结尾时,才会匹配一次或多次。本质上,这与每个位置的倒数第二个元素匹配,以逗号分隔。
另一种方法是嵌套str_split
方法,该方法按|
拆分,并sapply
str_split
到每个元素,第二次拆分,
。此方法不需要包,但假定状态始终是每个位置的第三个元素:
unname(sapply(unlist(str_split(test, "\|")),
function(x) unlist(str_split(x, ", "))[3]))
结果:
[1] "California" "Connecticut" "Massachusetts" "Massachusetts" "Missouri" "New York"
[7] "New York" "North Carolina" "Ohio" "Tennessee" "Washington" "Korea"
[13] "Korea" "Seoul" "Korea" "Korea" NA
请注意,最后一个元素是NA
,因为它没有第三个元素。