r-将Summary()输出转换为data.table



让我们调用summary(my_data):

year         quarter         employed         newhires       separations      jobscreated     jobsdestroyed  
Min.   :1990   Min.   :1.000   Min.   :  6976   Min.   :  2321   Min.   :  1922   Min.   :  1091   Min.   :  520  
1st Qu.:2000   1st Qu.:2.000   1st Qu.: 28049   1st Qu.: 16858   1st Qu.: 13912   1st Qu.:  6595   1st Qu.: 3862  
Median :2003   Median :3.000   Median : 64836   Median : 39188   Median : 32018   Median : 14148   Median : 7727  
Mean   :2003   Mean   :2.509   Mean   : 94468   Mean   : 59336   Mean   : 48973   Mean   : 22036   Mean   :11843  
3rd Qu.:2007   3rd Qu.:4.000   3rd Qu.:121905   3rd Qu.: 75960   3rd Qu.: 61976   3rd Qu.: 26829   3rd Qu.:14993  
Max.   :2010   Max.   :4.000   Max.   :571419   Max.   :448423   Max.   :391454   Max.   :166022   Max.   :80338  
NA's   :49                        NA's   :49       NA's   :49     

我想把这个输出转换成如下格式的data.table,其中所有条目(在本描述中省略(都是min,1st quartile的原始值。等等:

year         quarter         employed         newhires       separations      jobscreated     jobsdestroyed  
Min.      
1st Qu.   
Median    
Mean     
3rd Qu.   
Max.            
NA's          

以下几乎达到了这个结果,除了Min、1st Qu.、Middle、Mean、3rd Qu.、Max和NA结转到每个条目之外。我想要纯粹的原始数字。

data.frame(unclass(summary(my_data)), check.names = FALSE, stringsAsFactors = FALSE)
year         quarter         employed         newhires      separations      jobscreated   jobsdestroyed
X   Min.   :1990   Min.   :1.000   Min.   :  6976   Min.   :  2321   Min.   :  1922   Min.   :  1091   Min.   :  520  
X.1 1st Qu.:2000   1st Qu.:2.000   1st Qu.: 28049   1st Qu.: 16858   1st Qu.: 13912   1st Qu.:  6595   1st Qu.: 3862  
X.2 Median :2003   Median :3.000   Median : 64836   Median : 39188   Median : 32018   Median : 14148   Median : 7727  
X.3 Mean   :2003   Mean   :2.509   Mean   : 94468   Mean   : 59336   Mean   : 48973   Mean   : 22036   Mean   :11843  
X.4 3rd Qu.:2007   3rd Qu.:4.000   3rd Qu.:121905   3rd Qu.: 75960   3rd Qu.: 61976   3rd Qu.: 26829   3rd Qu.:14993  
X.5 Max.   :2010   Max.   :4.000   Max.   :571419   Max.   :448423   Max.   :391454   Max.   :166022   Max.   :80338  
X.6           <NA>            <NA>             <NA>     NA's   :49               <NA>     NA's   :49      NA's   :49  

潜在的解决方案包括(1(直接从summary((派生表,或(2(使用上面的输出,并找到一种方法从reach条目中删除Min.、1st Q.、Middle、Mean、3rd Q.、Max和NA标签,并将它们列为列名。非常感谢你的帮助!

重复链接代码的自适应,因为它在不完整的(NA/NaN(数据(例如,newhires(中不能干净地工作:

mtcars[2,2] <- NA
mtcars[3,2] <- NaN
as.data.frame(sapply(mtcars, summary))
# Error in dimnames(x) <- dnx : 'dimnames' applied to non-array

这是因为summary硬编码(bleh(table(..., useNA="ifany"),这意味着一些列可能返回长度6,一些长度7,这挫败了大多数符合类data.frame结构的随意尝试。

解决这一问题的一种方法(除了从头开始重写summary以修复该错误(是将一个NA添加到所有向量并从结果中减去;这迫使所有摘要都包含该字段,一旦减去该字段,它就应该表示数据,并且对于as.data.frame:来说足够矩形

fixed_summary <- function(object, ...) {
o <- summary(c(object, NA), ...)
o["NA's"] <- o["NA's"] - 1L
o
}
ret <- as.data.frame(sapply(mtcars, fixed_summary))
ret
#              mpg      cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb
# Min.    10.40000 4.000000  71.1000  52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000
# 1st Qu. 15.42500 4.000000 120.8250  96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000
# Median  19.20000 6.000000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000
# Mean    20.09062 6.266667 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
# 3rd Qu. 22.80000 8.000000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000
# Max.    33.90000 8.000000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000
# NA's     0.00000 2.000000   0.0000   0.0000 0.000000 0.00000  0.00000 0.0000 0.00000 0.0000 0.0000

而且,根据您的担忧,"Min."(etc(标签不会为每一列保留:它们只是行名。

许多R工具不能保证保留行名;事实上,有些人会想尽办法把它们擦干净。我的偏好是不依赖行名,而是将它们作为显式列引入。这主要是主观的、部分防御性的编程,当然不是要求。

ret$rownames <- rownames(ret)
rownames(ret) <- NULL
ret
#        mpg      cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb rownames
# 1 10.40000 4.000000  71.1000  52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000     Min.
# 2 15.42500 4.000000 120.8250  96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000  1st Qu.
# 3 19.20000 6.000000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000   Median
# 4 20.09062 6.266667 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125     Mean
# 5 22.80000 8.000000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000  3rd Qu.
# 6 33.90000 8.000000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000     Max.
# 7  0.00000 2.000000   0.0000   0.0000 0.000000 0.00000  0.00000 0.0000 0.00000 0.0000 0.0000     NA's

(柱的顺序是完全可延展的。(

# adjust summary(.) 
# returns summary of numeric (including factor) columns of a data frame 
# stats_along='row', put summary stats on the rows and variables along the columns
my_summ <- function(df, stats_along='row') {
df_nonchar = df[, !sapply(df, typeof) %in% "character"]
summ = data.frame(summary(df_nonchar), row.names = NULL)
# test for empty columns:
# # usually the 1st column is empty as a result of coercing an obj of
# class(summary obj) "table" to data.frame.
empty = sapply(summ, function(x) all(x == ""))
summ = summ[, !empty]
summ = setNames(summ, c("var_name", "stats"))
summ = summ[which(!is.na(summ$stats)), ]
# just in case if there are multiple :'s, we need to split only at the first match
summ$stats = sub(":", "-;-", summ$stats)
summ = data.frame(summ[1], do.call(rbind, strsplit(summ$stats, "-;-")))
names(summ)[-1] = c("stats", "value")
summ$var_name = trimws(summ$var_name) # rm white spaces
# pivot into wide form, using 'stats' column as a key.
stats_along = match.arg(stats_along, c('row', 'col'))
if (stats_along == 'row') {
idvar = "stats"
timevar = "var_name"
} else if (stats_along == 'col') {
idvar = "var_name"
timevar = "stats"
}
summ = reshape(
summ,
direction = "wide",
idvar = idvar,
timevar = timevar,
v.names = "value",
sep = "_"
)
var_nms = sub("(value_)(.+)", "\2", names(summ)[-1])
names(summ)[-1] = var_nms
rownames(summ) = NULL

# remove white spaces from cells
summ[] = lapply(summ, function(x) gsub("\s+$", "", x))
# when vars in the dataset contain NAs, we may have two additional columns in
# summary call
nas = "NA's" %in% colnames(summ)
if (any(nas)) {
names(summ)[names(summ) == "NA's"] = "missing"
}
summ
}
my_summ(mtcars)
stats   mpg   cyl  disp    hp  drat    wt  qsec     vs     am  gear  carb
1    Min. 10.40 4.000  71.1  52.0 2.760 1.513 14.50 0.0000 0.0000 3.000 1.000
2 1st Qu. 15.43 4.000 120.8  96.5 3.080 2.581 16.89 0.0000 0.0000 3.000 2.000
3  Median 19.20 6.000 196.3 123.0 3.695 3.325 17.71 0.0000 0.0000 4.000 2.000
4    Mean 20.09 6.188 230.7 146.7 3.597 3.217 17.85 0.4375 0.4062 3.688 2.812
5 3rd Qu. 22.80 8.000 326.0 180.0 3.920 3.610 18.90 1.0000 1.0000 4.000 4.000
6    Max. 33.90 8.000 472.0 335.0 4.930 5.424 22.90 1.0000 1.0000 5.000 8.000

如果优选沿着rowvars和沿着column轴的stats,则

my_summ(mtcars, 'col')
var_name Min.    1st Qu. Median  Mean    3rd Qu. Max.   
1       mpg   10.40   15.43   19.20   20.09   22.80   33.90
2       cyl   4.000   4.000   6.000   6.188   8.000   8.000
3      disp    71.1   120.8   196.3   230.7   326.0   472.0
4        hp    52.0    96.5   123.0   146.7   180.0   335.0
5      drat   2.760   3.080   3.695   3.597   3.920   4.930
6        wt   1.513   2.581   3.325   3.217   3.610   5.424
7      qsec   14.50   16.89   17.71   17.85   18.90   22.90
8        vs  0.0000  0.0000  0.0000  0.4375  1.0000  1.0000
9        am  0.0000  0.0000  0.0000  0.4062  1.0000  1.0000
10     gear   3.000   3.000   4.000   3.688   4.000   5.000
11     carb   1.000   2.000   2.000   2.812   4.000   8.000

  • 注意:如果您需要一个数据表,请使用my_summ(.) |> as.data.table()

另一个选项是创建自己的摘要函数,如果需要,可以添加第六个元素:

ownSummary = function(x) {
x = summary(x)
if(length(x)==6) x[7] = 0
x
} 

然后你就可以运行这个了。

data.table(sapply(mtcars, ownSummary), keep.rownames = T)
rn      mpg      cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb
1:    Min. 10.40000 4.000000  71.1000  52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000
2: 1st Qu. 15.42500 4.000000 120.8250  96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000
3:  Median 19.20000 6.000000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000
4:    Mean 20.09062 6.266667 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
5: 3rd Qu. 22.80000 8.000000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000
6:    Max. 33.90000 8.000000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000
7:          0.00000 2.000000   0.0000   0.0000 0.000000 0.00000  0.00000 0.0000 0.00000 0.0000 0.0000

最新更新