我正在处理一个大型数据集。我在数据帧中有变量,例如调用。
Part<-c(1,2,3,4,5,6,7)
Disease_codes>- c(A100,A145,B165,B187,B102,C132,D156)
df<-data.frame(Part,Disease_codes)
事实上,我想把所有从"A"开始的疾病代码分类为"血液癌症"。从字母A开始的疾病代码(例如A100、A145(是血液癌症。因为我需要将患有癌症的参与者排除在我的研究之外。当然,我不能做这件事,因为我有大量的参与者。那么,我如何才能将疾病代码以a开头的人分成一个子集,然后将他们从我的数据框中排除呢。例如,我想要以下类型的输出。
Blood_Cancer_Part<-c(1,2)
Part_without_Blood_cancer<-c(3,4,5,6,7)
在基R中,我们可以使用subset
:
BloodCancer <- subset(df, grepl('^A', Disease_codes), select = Part)
#OR
#BloodCancer <- subset(df, startsWith(Disease_codes, "A"))
BloodCancer
# Part
#1 1
#2 2
Part_without_Blood_cancer <- subset(df, !grepl('^A', Disease_codes))
#OR
#Part_without_Blood_cancer <- subset(df, !startsWith(Disease_codes, "A"))
Part_without_Blood_cancer
# Part
#3 3
#4 4
#5 5
#6 6
#7 7
数据
Part<-c(1,2,3,4,5,6,7)
Disease_codes <- c("A100","A145","B165","B187","B102","C132","D156")
df<-data.frame(Part,Disease_codes, stringsAsFactors = FALSE)
这里有一种方法,您可以使用stringr包来检查给定文本中的第一个字母,并相应地从已经存在的Part列创建一个列。
library(stringr)
library(dplyr)
# Creating the dataframe
Part <- c(1,2,3,4,5,6,7)
Disease_codes <- c("A100","A145","B165","B187","B102","C132","D156")
df <- data.frame(Part, Disease_codes)
df <-
df %>%
# If first letter of Disease_codes contains A then create column from value of Part
mutate(Blood_Cancer_Part = ifelse(str_sub(Disease_codes, 1, 1) == "A", Part, NA_character_),
# If first letter of Disease_codes does not contains A then
# create column from value of Part
Part_without_Blood_cancer = ifelse(str_sub(Disease_codes, 1, 1) != "A", Part,
NA_character_))
# To view as vectors
df$Blood_Cancer_Part[!is.na(df$Blood_Cancer_Part)]
# [1] "1" "2"
df$Part_without_Blood_cancer[!is.na(df$Part_without_Blood_cancer)]
# [1] "3" "4" "5" "6" "7"