为什么不能去掉R中空格和括号后的第一个单词?



我正在努力摆脱:

  1. 空格和/或第一个单词后的单词;
  2. 想去掉第一个单词
  3. 后面的括号
  4. 或者只保留
  5. 列的第一个单词

这是我的数据集类型:

structure(list(med_name = c("Co-amoxiclav", "doxycycline", "Gentamicin", 
"Co-trimoxazole", "Sodium Chloride 0.9% infusion (ANES) 20 mL + Vancomycin", 
"Piperacillin + Tazobactam (contains penicillin)"), new1 = c("Co-amoxiclav", 
"doxycycline", "Gentamicin", "Co-trimoxazole", "Sodium Chloride", 
"Piperacillin + Tazobactam (contains penicillin)"), new2 = c(NA, 
NA, NA, NA, "Vancomycin", "Tazobactam (contains penicillin)")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))
使用下面的代码,我相信我去掉了空格后的第一个单词和括号中的单词。实际上我想只保留每列中的第一个单词new_1和new_2
dt_test_1 <- dt_test %>%
dplyr::mutate(keep_first_letter_new_1 = gsub(' [A-z ]*', "", new1), 
keep_first_letter_new_2 = gsub(' [A-z]*', "", new2), 
remove_brackets_new_1 = gsub("( )", "", keep_first_letter_new_1), 
remove_brackets_new_2 = gsub("( )", "", keep_first_letter_new_2)
)

然而,正如在dt_test_1 -中观察到的那样,我没有得到我想要的输出。检查最后两列,以及每列的最后一行。下面是我得到的:

structure(list(med_name = c("Co-amoxiclav", "doxycycline", "Gentamicin", 
"Co-trimoxazole", "Sodium Chloride 0.9% infusion (ANES) 20 mL + Vancomycin", 
"Piperacillin + Tazobactam (contains penicillin)"), new1 = c("Co-amoxiclav", 
"doxycycline", "Gentamicin", "Co-trimoxazole", "Sodium Chloride", 
"Piperacillin + Tazobactam (contains penicillin)"), new2 = c(NA, 
NA, NA, NA, "Vancomycin", "Tazobactam (contains penicillin)"), 
keep_first_letter_new_1 = c("Co-amoxiclav", "doxycycline", 
"Gentamicin", "Co-trimoxazole", "Sodium", "Piperacillin+(contains)"
), keep_first_letter_new_2 = c(NA, NA, NA, NA, "Vancomycin", 
"Tazobactam(contains)"), remove_brackets_new_1 = c("Co-amoxiclav", 
"doxycycline", "Gentamicin", "Co-trimoxazole", "Sodium", 
"Piperacillin+(contains)"), remove_brackets_new_2 = c(NA, 
NA, NA, NA, "Vancomycin", "Tazobactam(contains)")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

因此,我的问题是,为什么我没有得到我想要的输出?这是我期望的输出:

structure(list(med_name = c("Co-amoxiclav", "doxycycline", "Gentamicin", 
"Co-trimoxazole", "Sodium Chloride 0.9% infusion (ANES) 20 mL + Vancomycin", 
"Piperacillin + Tazobactam (contains penicillin)"), new1 = c("Co-amoxiclav", 
"doxycycline", "Gentamicin", "Co-trimoxazole", "Sodium Chloride", 
"Piperacillin + Tazobactam (contains penicillin)"), new2 = c(NA, 
NA, NA, NA, "Vancomycin", "Tazobactam (contains penicillin)"), 
keep_first_letter_new_1 = c("Co-amoxiclav", "doxycycline", 
"Gentamicin", "Co-trimoxazole", "Sodium", "Piperacillin+(contains)"
), keep_first_letter_new_2 = c(NA, NA, NA, NA, "Vancomycin", 
"Tazobactam(contains)"), remove_brackets_new_1 = c("Co-amoxiclav", 
"doxycycline", "Gentamicin", "Co-trimoxazole", "Sodium", 
"Piperacillin"), remove_brackets_new_2 = c(NA, NA, NA, NA, 
"Vancomycin", "Tazobactam")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

最后两列是我最后需要的。

在下面的代码中,x只是您提供的第一个数据结构。我发现用"+"分隔字符串更容易。签名,然后提取第一个单词。周围的空间可以通过stringr包中的str_trim来修整。

med=x$med_name
library(stringr)
strings=str_split(med, "\+")
out=lapply(strings, function(x) {
str_trim(str_extract(x, " *(\S*)"))}
)
out
[[1]]
[1] "Co-amoxiclav"
[[2]]
[1] "doxycycline"
[[3]]
[1] "Gentamicin"
[[4]]
[1] "Co-trimoxazole"
[[5]]
[1] "Sodium"     "Vancomycin"
[[6]]
[1] "Piperacillin" "Tazobactam"
new_1=c()
new_2=c()
for (i in 1:6) {
new_1[i]=out[[i]][1]
new_2[i]=out[[i]][2]
}        
transform(x[-c(2,3)], new_1=new_1, new_2=new_2)
med_name          new_1      new_2
1                                            Co-amoxiclav   Co-amoxiclav       <NA>
2                                             doxycycline    doxycycline       <NA>
3                                              Gentamicin     Gentamicin       <NA>
4                                          Co-trimoxazole Co-trimoxazole       <NA>
5 Sodium Chloride 0.9% infusion (ANES) 20 mL + Vancomycin         Sodium Vancomycin
6         Piperacillin + Tazobactam (contains penicillin)   Piperacillin Tazobactam

在尝试执行不需要的字符的所有修剪时,我似乎错过了一个顺序。因此,我认为错误在于我去掉了上面的空格,然后括号中的单词与第一个单词成为一个单词。然后,当我申请删除括号中的单词时,我没有成功,因为单词之间没有空白。

因此,我在去掉不需要的字符和单词时保持一个特定的顺序:

  1. 我删除空白,但是
  2. 添加一个新的-删除单词之间的连字符,然后
  3. 我正在去掉第一个单词之后的第二个单词。

以下是我解决上面问题的方法:

dt_test_1 <- dt_test %>%
dplyr::mutate(keep_first_letter_new_1 = gsub(' [A-z ]*', "", new1), 
keep_first_letter_new_2 = gsub(' [A-z]*', "", new2), 
remove_hypen_new_1 = gsub("-", "", keep_first_letter_new_1), 
remove_hypen_new_2 = gsub("-", "", keep_first_letter_new_2), 
remove_any_words_after_first_new1 =    gsub("\s([^\)]+\)",remove_hypen_new_1, "", remove_hypen_new_1), 
remove_any_words_after_first_new2 = gsub("\s([^\)]+\)",remove_hypen_new_1, "", remove_hypen_new_2))

相关内容

  • 没有找到相关文章

最新更新