我有这个字符向量:
protein = "ARTKQTARKSTGGKAPRKQLATKAARKSAPATGGVKKPHRYRPGTVALREIRRYQKSTELLIRKLPFQRLVREIAQDFKTDLRFQSSAVMALQEACEAYLVGLFEDTNLCAIHAKRVTIMPKDIQLARRIRGERA"
我想根据字母R的出现情况将其分段。
peptide_fragments <- str_split(protein, "(?<=[R])")
现在,从生成的片段中,我想省略以下子字符串:
- 不包含字母K
然后从剩余的子字符串中省略:
- 字符长度小于6的字符
使用纯基R正则表达式方法,我们可以尝试:
protein <- "ARTKQTARKSTGGKAPRKQLATKAARKSAPATGGVKKPHRYRPGTVALREIRRYQKSTELLIRKLPFQRLVREIAQDFKTDLRFQSSAVMALQEACEAYLVGLFEDTNLCAIHAKRVTIMPKDIQLARRIRGERA"
parts <- strsplit(protein, "(?<=R)", perl=TRUE)[[1]]
output <- grep("^(?=.*K).{6,}$", parts, value=TRUE, perl=TRUE)
output
[1] "TKQTAR" "KSTGGKAPR"
[3] "KQLATKAAR" "KSAPATGGVKKPHR"
[5] "YQKSTELLIR" "KLPFQR"
[7] "EIAQDFKTDLR" "FQSSAVMALQEACEAYLVGLFEDTNLCAIHAKR"
[9] "VTIMPKDIQLAR"
如果您想在之后拆分"R〃:
temp <- unlist(str_split(protein, "(?<=R)"))
res <- temp[grepl("K", temp) & !nchar(temp) < 6]
结果:
res
[1] "TKQTAR" "KSTGGKAPR"
[3] "KQLATKAAR" "KSAPATGGVKKPHR"
[5] "YQKSTELLIR" "KLPFQR"
[7] "EIAQDFKTDLR" "FQSSAVMALQEACEAYLVGLFEDTNLCAIHAKR"
[9] "VTIMPKDIQLAR"