R 中的 'na.string = na.strings=c(,)' 引入了一个新的和不需要的因子水平 NA



我有从data.table中删除一些NA的命令。我使用以下命令导入了数据集,其中所有空单元格都已替换为NA。我注意到na.strings=c('',' ')实际上创建了一个新的级别。

如何避免这种情况我想这与变量的格式有关

mydata<- setDT(read.csv("~/mydata.csv",na.strings=c('',' ') ))
> str(mydata[,10:11])
Classes ‘data.table’ and 'data.frame':  295114 obs. of  2 variables:
$ location: Factor w/ 22 levels "BALI","BANTEN",..: 4 4 4 4 4 4 6 6 6 6 ...
$ region  : Factor w/ 6 levels "Eastern Indonesia",..: 2 2 2 2 2 2 3 3 3 3 ...
- attr(*, ".internal.selfref")=<externalptr> 
> summary(mydata[,10:11])
location                    region      
DKI JAKARTA     :2263   Eastern Indonesia:  14  
BANTEN          : 356   Jakarta          :2263  
JAWA BARAT      : 150   Java&Bali        : 637  
JAWA TIMUR      : 128   Kalimantan       :  15  
KALIMANTAN TIMUR:  15   NA               :  17  
(Other)         :  18   Sumatra          :   2  
NA's            :  17  

mydata<- setDT(read.csv("~/mydata.csv",na.strings=' '))
> str(clientData[,10:11])
Classes ‘data.table’ and 'data.frame':  295114 obs. of  2 variables:
$ location: Factor w/ 23 levels "","BALI","BANTEN",..: 5 5 5 5 5 5 7 7 7 7 ...
$ region  : Factor w/ 6 levels "Eastern Indonesia",..: 2 2 2 2 2 2 3 3 3 3 ...
- attr(*, ".internal.selfref")=<externalptr> 
> summary(clientData[,10:11])
location                    region      
DKI JAKARTA     :22635   Eastern Indonesia:  147  
BANTEN          : 3568   Jakarta          :22635  
JAWA BARAT      : 1507   Java&Bali        : 6379  
JAWA TIMUR      : 1289   Kalimantan       :  155  
:  171   NA               :  171  
KALIMANTAN TIMUR:  154   Sumatra          :   22  

我试图手动删除这些NA

mydata <- mydata[!region == 'NA', ] 
> summary(mydata[,11])
countries      
Eastern Indonesia:  1472  
Jakarta          :226357  
Java&Bali        : 63791  
Kalimantan       :  1557  
NA               :  0  
Sumatra          :   222 

如何从数据中删除整个级别的NA

这可以写成一个函数来检查mydata的所有列,这些列是因子,如果有级别NA,则将其删除吗?类似的东西

clean_data <- function(data){
# if columns is factor - drop levels "NA"
# otherwise remove NAs 
}

mydata <- lapply(mydata, function(x) if(is.factor(x)) droplevels(x) else x)

如果我们在数据帧中有NA作为级别,我们可以通过再次将值转换为因子来删除它。由于factor在生成因子级别时默认排除NA

df1[] <- lapply(df1, function(x) if(is.factor(x)) factor(x) else x)

使用dplyr:

library(dplyr)
df1 %>% mutate_if(is.factor, factor)

考虑这个例子,

x <- factor(c(NA, 'a', 'b'), exclude = NULL)
df1 <- data.frame(a = x, b = x)
str(df1)
#'data.frame':  3 obs. of  2 variables:
# $ a: Factor w/ 3 levels "a","b",NA: 3 1 2
# $ b: Factor w/ 3 levels "a","b",NA: 3 1 2
df1[] <- lapply(df1, function(x) if(is.factor(x)) factor(x) else x)
str(df1)
#'data.frame':  3 obs. of  2 variables:
# $ a: Factor w/ 2 levels "a","b": NA 1 2
# $ b: Factor w/ 2 levels "a","b": NA 1 2

我们可以使用索引方法

i1 <- sapply(df1, is.factor)
df[i1] <- lapply(df1, factor)

最新更新