我有从data.table
中删除一些NA
的命令。我使用以下命令导入了数据集,其中所有空单元格都已替换为NA
。我注意到na.strings=c('',' ')
实际上创建了一个新的级别。
如何避免这种情况我想这与变量的格式有关
mydata<- setDT(read.csv("~/mydata.csv",na.strings=c('',' ') ))
> str(mydata[,10:11])
Classes ‘data.table’ and 'data.frame': 295114 obs. of 2 variables:
$ location: Factor w/ 22 levels "BALI","BANTEN",..: 4 4 4 4 4 4 6 6 6 6 ...
$ region : Factor w/ 6 levels "Eastern Indonesia",..: 2 2 2 2 2 2 3 3 3 3 ...
- attr(*, ".internal.selfref")=<externalptr>
> summary(mydata[,10:11])
location region
DKI JAKARTA :2263 Eastern Indonesia: 14
BANTEN : 356 Jakarta :2263
JAWA BARAT : 150 Java&Bali : 637
JAWA TIMUR : 128 Kalimantan : 15
KALIMANTAN TIMUR: 15 NA : 17
(Other) : 18 Sumatra : 2
NA's : 17
mydata<- setDT(read.csv("~/mydata.csv",na.strings=' '))
> str(clientData[,10:11])
Classes ‘data.table’ and 'data.frame': 295114 obs. of 2 variables:
$ location: Factor w/ 23 levels "","BALI","BANTEN",..: 5 5 5 5 5 5 7 7 7 7 ...
$ region : Factor w/ 6 levels "Eastern Indonesia",..: 2 2 2 2 2 2 3 3 3 3 ...
- attr(*, ".internal.selfref")=<externalptr>
> summary(clientData[,10:11])
location region
DKI JAKARTA :22635 Eastern Indonesia: 147
BANTEN : 3568 Jakarta :22635
JAWA BARAT : 1507 Java&Bali : 6379
JAWA TIMUR : 1289 Kalimantan : 155
: 171 NA : 171
KALIMANTAN TIMUR: 154 Sumatra : 22
我试图手动删除这些NA
mydata <- mydata[!region == 'NA', ]
> summary(mydata[,11])
countries
Eastern Indonesia: 1472
Jakarta :226357
Java&Bali : 63791
Kalimantan : 1557
NA : 0
Sumatra : 222
如何从数据中删除整个级别的NA
?
这可以写成一个函数来检查mydata
的所有列,这些列是因子,如果有级别NA
,则将其删除吗?类似的东西
clean_data <- function(data){
# if columns is factor - drop levels "NA"
# otherwise remove NAs
}
或
mydata <- lapply(mydata, function(x) if(is.factor(x)) droplevels(x) else x)
如果我们在数据帧中有NA
作为级别,我们可以通过再次将值转换为因子来删除它。由于factor
在生成因子级别时默认排除NA
。
df1[] <- lapply(df1, function(x) if(is.factor(x)) factor(x) else x)
使用dplyr
:
library(dplyr)
df1 %>% mutate_if(is.factor, factor)
考虑这个例子,
x <- factor(c(NA, 'a', 'b'), exclude = NULL)
df1 <- data.frame(a = x, b = x)
str(df1)
#'data.frame': 3 obs. of 2 variables:
# $ a: Factor w/ 3 levels "a","b",NA: 3 1 2
# $ b: Factor w/ 3 levels "a","b",NA: 3 1 2
df1[] <- lapply(df1, function(x) if(is.factor(x)) factor(x) else x)
str(df1)
#'data.frame': 3 obs. of 2 variables:
# $ a: Factor w/ 2 levels "a","b": NA 1 2
# $ b: Factor w/ 2 levels "a","b": NA 1 2
我们可以使用索引方法
i1 <- sapply(df1, is.factor)
df[i1] <- lapply(df1, factor)