r-自定义函数，用于基于变量类汇总数据帧

我正在尝试用R编写一个自定义函数，如果变量是数字的，它将取平均值和标准差；或者如果变量是分类的，则统计每个级别的出现次数。作为附加的转折点，我希望函数只计算"；是"；如果变量是分类的并且其级别包括"0"；是的"；。理想情况下，我希望这是一个输入参数，如果调用它，它只计算"0"的数量；是的"；。

为了演示，我们假设我们有以下数据：

Height <- round(rnorm(10, 175, 10), 0)
weight <- round(rnorm(10, 70, 10), 0)
smoke <- c(rep("Yes", 3), rep("No", 4), rep("Unknown", 3))
problem <- c(rep("depression", 3), rep("insomnia", 2), rep("IBS", 5))
data <- as.data.frame(cbind(Height, weight, smoke, problem))
data$Height <- as.numeric(data$Height)
data$weight <- as.numeric(data$weight)
data$smoke <- factor(data$smoke, levels = c("Yes", "No", "Unknown"))
data$problem <- factor(data$problem, levels = c("depression", "insomnia", "IBS"))

我的功能是：

sumfun <- function(x){
if(is.numeric(x)){
m = round(mean(x, na.rm = T), digits = 2)
s = round(sd(x, na.rm = T), digits = 2)
return(list(cbind("", paste0(m, " ", "(", s, ")", " ", "/", " ", sum(!is.na(x))))))
} else
# (is.factor(x) | is.character(x)){
n = table(x)
pro = round(prop.table(n), 2)
return(list(cbind("", levels(x), paste0(n, " ", "(", pro*100, ")", " ", "/", sum(!is.na(x))))))
}

因为我希望输出以特定的方式，所以我编写了另一个函数，而不是使用applyfamily。此功能是：

tabsum <- function(table){
out <- data.frame()
for(col in colnames(table)){
# out <- rbind(out, list(col, "", ""))
out[nrow(out) + 1, 1] <- col
for(row in sumfun(table[, col])){
if(is.numeric(table[,col])){
out[nrow(out), 2:3] <- row
} else {
out[nrow(out), 2:3] <- c("", "")
out <- rbind(out, as.data.frame(row))  
}
}
}  
colnames(out) <- c("Variable", "Levels", "Mean (SD) or N (%)")
return(out)
}

在上面的数据集上使用这两个函数产生：

tabsum(data)

变量		weight		烟雾

您的代码更新为仅包括因子字段的因子级别"是"：

对于sumfun功能：

我完全放弃了else语句
我添加了另一个CCD_ 4语句，因为因子包含是以包含因子水平"为条件的；是"；
- 在此if语句中，您使用了函数table。您在tabsum中使用了table作为对象名称：我鼓励您使用不同的对象名称，因此不会出现混淆
- 我将对table的调用修改为有条件调用，并删除空级别
- 我把退货单改了不少

sumfun <- function(x){
if(is.numeric(x)){
m = round(mean(x, na.rm = T), digits = 2)
s = round(sd(x, na.rm = T), digits = 2)
return(list(cbind("", paste0(m, " ", "(", s, ")", " ", "/", " ", sum(!is.na(x))))))
} 
if(is.factor(x)){
if(length(x[x == "Yes"]) > 0){
n = table(droplevels(x[x == "Yes"]))
return(list(cbind("", "Yes", paste0(n, " ", 
"(" , n/sum(!is.na(x)) * 100, ")", 
" ", "/ ", sum(!is.na(x))))))
}
}
}

在tabsum功能中：

我删除了out[nrow(out) + 1, 1] <- col，否则每一列都将在表中。列包含是有条件的
我在行循环中添加了out[nrow(out) + 1, 1] <- col，因此只包含从sumfun返回值的名称
我将else语句更改为带有条件的if语句

tabsum <- function(table){
out <- data.frame()
for(col in colnames(table)){
for(row in sumfun(table[, col])){
out[nrow(out) + 1, 1] <- col # moved this in row loop, col conditions**
if(is.numeric(table[,col])){
out[nrow(out), 2:3] <- row
}
if(is.factor(table[,col])){
out[nrow(out), 2:3] <- c("", "")
out <- rbind(out, as.data.frame(row)) 
}
}
}
colnames(out) <- c("Variable", "Levels", "Mean (SD) or N (%)")
return(out)
}
tabsum(data)
#   Variable Levels Mean (SD) or N (%)
# 1   Height        174.2 (12.05) / 10
# 2   weight         71.3 (12.47) / 10
# 3    smoke                          
# 4             Yes        3 (30) / 10

既然包含了smoke == 'Yes'，那么您可以将信息放在一行中。

为此，在sumfun中，通过丢弃初始"":来更改最终return调用

return(list(cbind("Yes", paste0(n, " ", 
"(" , n/sum(!is.na(x)) * 100, ")",      
" ", "/ ", sum(!is.na(x))))))

在tabsum中，将if(is.factor...更改为只有一个对out的调用，如下所示：

if(is.factor(table[,col])){
out[nrow(out), 1:3] <- c(col, row)
}

这将表格修改为以下内容：

tabsum(data)
#   Variable Levels Mean (SD) or N (%)
# 1   Height        174.2 (12.05) / 10
# 2   weight         71.3 (12.47) / 10
# 3    smoke    Yes        3 (30) / 10

相关内容

最新更新

热门标签：