我正在努力摆脱:
- 空格和/或第一个单词后的单词;
- 想去掉第一个单词 后面的括号
- 或者只保留 列的第一个单词
这是我的数据集类型:
structure(list(med_name = c("Co-amoxiclav", "doxycycline", "Gentamicin",
"Co-trimoxazole", "Sodium Chloride 0.9% infusion (ANES) 20 mL + Vancomycin",
"Piperacillin + Tazobactam (contains penicillin)"), new1 = c("Co-amoxiclav",
"doxycycline", "Gentamicin", "Co-trimoxazole", "Sodium Chloride",
"Piperacillin + Tazobactam (contains penicillin)"), new2 = c(NA,
NA, NA, NA, "Vancomycin", "Tazobactam (contains penicillin)")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
使用下面的代码,我相信我去掉了空格后的第一个单词和括号中的单词。实际上我想只保留每列中的第一个单词new_1和new_2
dt_test_1 <- dt_test %>%
dplyr::mutate(keep_first_letter_new_1 = gsub(' [A-z ]*', "", new1),
keep_first_letter_new_2 = gsub(' [A-z]*', "", new2),
remove_brackets_new_1 = gsub("( )", "", keep_first_letter_new_1),
remove_brackets_new_2 = gsub("( )", "", keep_first_letter_new_2)
)
然而,正如在dt_test_1 -中观察到的那样,我没有得到我想要的输出。检查最后两列,以及每列的最后一行。下面是我得到的:
structure(list(med_name = c("Co-amoxiclav", "doxycycline", "Gentamicin",
"Co-trimoxazole", "Sodium Chloride 0.9% infusion (ANES) 20 mL + Vancomycin",
"Piperacillin + Tazobactam (contains penicillin)"), new1 = c("Co-amoxiclav",
"doxycycline", "Gentamicin", "Co-trimoxazole", "Sodium Chloride",
"Piperacillin + Tazobactam (contains penicillin)"), new2 = c(NA,
NA, NA, NA, "Vancomycin", "Tazobactam (contains penicillin)"),
keep_first_letter_new_1 = c("Co-amoxiclav", "doxycycline",
"Gentamicin", "Co-trimoxazole", "Sodium", "Piperacillin+(contains)"
), keep_first_letter_new_2 = c(NA, NA, NA, NA, "Vancomycin",
"Tazobactam(contains)"), remove_brackets_new_1 = c("Co-amoxiclav",
"doxycycline", "Gentamicin", "Co-trimoxazole", "Sodium",
"Piperacillin+(contains)"), remove_brackets_new_2 = c(NA,
NA, NA, NA, "Vancomycin", "Tazobactam(contains)")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
因此,我的问题是,为什么我没有得到我想要的输出?这是我期望的输出:
structure(list(med_name = c("Co-amoxiclav", "doxycycline", "Gentamicin",
"Co-trimoxazole", "Sodium Chloride 0.9% infusion (ANES) 20 mL + Vancomycin",
"Piperacillin + Tazobactam (contains penicillin)"), new1 = c("Co-amoxiclav",
"doxycycline", "Gentamicin", "Co-trimoxazole", "Sodium Chloride",
"Piperacillin + Tazobactam (contains penicillin)"), new2 = c(NA,
NA, NA, NA, "Vancomycin", "Tazobactam (contains penicillin)"),
keep_first_letter_new_1 = c("Co-amoxiclav", "doxycycline",
"Gentamicin", "Co-trimoxazole", "Sodium", "Piperacillin+(contains)"
), keep_first_letter_new_2 = c(NA, NA, NA, NA, "Vancomycin",
"Tazobactam(contains)"), remove_brackets_new_1 = c("Co-amoxiclav",
"doxycycline", "Gentamicin", "Co-trimoxazole", "Sodium",
"Piperacillin"), remove_brackets_new_2 = c(NA, NA, NA, NA,
"Vancomycin", "Tazobactam")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
最后两列是我最后需要的。
在下面的代码中,x
只是您提供的第一个数据结构。我发现用"+"分隔字符串更容易。签名,然后提取第一个单词。周围的空间可以通过stringr
包中的str_trim
来修整。
med=x$med_name
library(stringr)
strings=str_split(med, "\+")
out=lapply(strings, function(x) {
str_trim(str_extract(x, " *(\S*)"))}
)
out
[[1]]
[1] "Co-amoxiclav"
[[2]]
[1] "doxycycline"
[[3]]
[1] "Gentamicin"
[[4]]
[1] "Co-trimoxazole"
[[5]]
[1] "Sodium" "Vancomycin"
[[6]]
[1] "Piperacillin" "Tazobactam"
new_1=c()
new_2=c()
for (i in 1:6) {
new_1[i]=out[[i]][1]
new_2[i]=out[[i]][2]
}
transform(x[-c(2,3)], new_1=new_1, new_2=new_2)
med_name new_1 new_2
1 Co-amoxiclav Co-amoxiclav <NA>
2 doxycycline doxycycline <NA>
3 Gentamicin Gentamicin <NA>
4 Co-trimoxazole Co-trimoxazole <NA>
5 Sodium Chloride 0.9% infusion (ANES) 20 mL + Vancomycin Sodium Vancomycin
6 Piperacillin + Tazobactam (contains penicillin) Piperacillin Tazobactam
在尝试执行不需要的字符的所有修剪时,我似乎错过了一个顺序。因此,我认为错误在于我去掉了上面的空格,然后括号中的单词与第一个单词成为一个单词。然后,当我申请删除括号中的单词时,我没有成功,因为单词之间没有空白。
因此,我在去掉不需要的字符和单词时保持一个特定的顺序:
- 我删除空白,但是
- 添加一个新的-删除单词之间的连字符,然后
- 我正在去掉第一个单词之后的第二个单词。
以下是我解决上面问题的方法:
dt_test_1 <- dt_test %>%
dplyr::mutate(keep_first_letter_new_1 = gsub(' [A-z ]*', "", new1),
keep_first_letter_new_2 = gsub(' [A-z]*', "", new2),
remove_hypen_new_1 = gsub("-", "", keep_first_letter_new_1),
remove_hypen_new_2 = gsub("-", "", keep_first_letter_new_2),
remove_any_words_after_first_new1 = gsub("\s([^\)]+\)",remove_hypen_new_1, "", remove_hypen_new_1),
remove_any_words_after_first_new2 = gsub("\s([^\)]+\)",remove_hypen_new_1, "", remove_hypen_new_2))