r-变量每一类的线性回归



假设我正在使用R:中的iris数据集

data(iris)
summary(iris)
Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
Min.   : 4.300   Min.   : 2.000   Min.   : 1.000   Min.   : 0.100  
1st Qu.: 5.100   1st Qu.: 2.800   1st Qu.: 1.600   1st Qu.: 0.300  
Median : 5.800   Median : 3.000   Median : 4.350   Median : 1.300  
Mean   : 5.843   Mean   : 3.057   Mean   : 3.758   Mean   : 1.199  
3rd Qu.: 6.400   3rd Qu.: 3.300   3rd Qu.: 5.100   3rd Qu.: 1.800  
Max.   : 7.900   Max.   : 4.400   Max.   : 6.900   Max.   : 2.500  
Species  
setosa    : 50  
versicolor: 50  
virginica : 50

我想进行一个线性回归,其中Petal.Length是因变量,Sepal.Length是自变量。在R中,我如何一次对每个Species类别进行回归,得到每个测试的P、R²和F值?

使用by

by(iris, iris$Species, (x) summary(lm(Petal.Length ~ Sepal.Length, x)))
# iris$Species: setosa
# 
# Call:
#   lm(formula = Petal.Length ~ Sepal.Length, data = x)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -0.40856 -0.08027 -0.00856  0.11708  0.46512 
# 
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)  
# (Intercept)   0.80305    0.34388   2.335   0.0238 *
#   Sepal.Length  0.13163    0.06853   1.921   0.0607 .
# ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.1691 on 48 degrees of freedom
# Multiple R-squared:  0.07138, Adjusted R-squared:  0.05204 
# F-statistic:  3.69 on 1 and 48 DF,  p-value: 0.0607
# 
# --------------------------------------------------------- 
#   iris$Species: versicolor
# 
# Call:
#   lm(formula = Petal.Length ~ Sepal.Length, data = x)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -0.68611 -0.22827 -0.04123  0.19458  0.79607 
# 
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   0.18512    0.51421   0.360     0.72    
# Sepal.Length  0.68647    0.08631   7.954 2.59e-10 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.3118 on 48 degrees of freedom
# Multiple R-squared:  0.5686,  Adjusted R-squared:  0.5596 
# F-statistic: 63.26 on 1 and 48 DF,  p-value: 2.586e-10
# 
# --------------------------------------------------------- 
#   iris$Species: virginica
# 
# Call:
#   lm(formula = Petal.Length ~ Sepal.Length, data = x)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -0.68603 -0.21104  0.06399  0.18901  0.66402 
# 
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   0.61047    0.41711   1.464     0.15    
# Sepal.Length  0.75008    0.06303  11.901  6.3e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.2805 on 48 degrees of freedom
# Multiple R-squared:  0.7469,  Adjusted R-squared:  0.7416 
# F-statistic: 141.6 on 1 and 48 DF,  p-value: 6.298e-16

编辑

为了详细阐述我的评论,我们可以很容易地提取所需的值

by(iris, iris$Species, (x) lm(Petal.Length ~ Sepal.Length, x)) |>
lapply((x) {
with(summary(x), c(r2=r.squared, f=fstatistic, 
p=do.call(pf, c(as.list(unname(fstatistic)), lower.tail=FALSE))))
}) |> do.call(what=rbind)
#                    r2    f.value f.numdf f.dendf            p
# setosa     0.07138289   3.689765       1      48 6.069778e-02
# versicolor 0.56858983  63.263024       1      48 2.586190e-10
# virginica  0.74688439 141.636664       1      48 6.297786e-16

如果您想提取这些值,我们可以使用

library (dplyr) 
df <- iris
list_res <- df %>%
base::split (., df$Species, drop = FALSE) %>%
lapply (., function (x) {
fit <- lm(Petal.Length ~ Sepal.Length, data = x) %>%
summary ()
r <- fit$r.squared
coeffs <- fit$coefficients %>% 
as_tibble ()
f <- fit$fstatistic[[1]] 
list_res <- list (r, coeffs, f)
names (list_res) <- c("R-Squared", "Coefficients", "F-Value")
return (list_res)
})

它为每个回归模型返回一个包含三个对象的列表,其中包括所需的值。我把系数表留在这里,因为知道你的p值属于哪个自变量总是很好的。例如,如果您希望单独提取这些p值,我们可以使用coeffs <- fit$coefficients [,4] %>% as.list ()

最新更新