我正在R中运行线性回归,以查看我的任何变量与结果z之间是否存在显着关系。似乎我的变量与此结果都没有显着关系…直到我添加了名为"binary"的变量。突然之间,许多变量变得非常重要。我的问题是:为什么添加一个变量会如此剧烈地改变输出?
我使用的数据帧在下面:
sample <- data.frame(
Z = c(-0.5, 0.5, 0.5, 0.5, -0.5, 0.5, -0.5, 0.5, -0.5,
-0.5, 0.5, -0.5, -0.5, -0.5, 0.5, 0.5, 0.5, 0.5, -0.5, -0.5, 0.5),
v1 = c(23, 25, 42, 52, 38, 34, 57, 48, 29, 49,
31, 45, 31, 30, 29, 28, 41, 45, NA, NA, 31),
v2 = c("No", "Yes", "No", "No", "No", "No","No", "Yes", "No", "No", "Yes",
"Yes", "No", "No", "No", "No",
"No", "No", "No", "No", "No"),
v3 = c("No", "Yes", "No", "No", "No", "No", "No", "Yes", "No",
"No", "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "No",
"No", "No", "No"),
mar_status.factor = c(NA, NA, "Never Married", "Married",
"Never Married", "Never Married", "Never Married", "Married",
"Never Married", "Never Married", "Never Married", NA,
"Never Married", "Never Married", "Never Married", "Never Married",
"Never Married", "Separated", NA, NA, "Never Married"),
highest_ed.factor = c(NA, NA, "Did not complete high school", "Associates Degree",
"Regular high school diploma", "Some college credit, but less than 1 year",
"GED or equivalent", "Some college credit, but less than 1 year",
"Regular high school diploma", "Did not complete high school",
"Did not complete high school", NA, "Bachelors Degree",
"Did not complete high school", "Did not complete high school",
"Did not complete high school", "Bachelors Degree",
"GED or equivalent", NA, NA, "Did not complete high school"),
v4 = c(NA, NA, 3, 3, 3, NA, 2, 3, 5, 2, 1, NA, 3, 2,
1, 3, 3, 1, NA, NA, 1),
v5= c(NA, NA, 27600, 15000, 1400, NA, 600, 10800, NA, 12000, NA, NA, 9000, 3000,
2100, 13000, 60000, 10000, NA, NA, 0),
binary = c(NA, NA, 1, 1, 1, NA, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 1, 1, NA, NA, 1))
当我在R中运行下面的模型时,我得到了相应的输出,这是完全无关紧要的。
Call:
lm(formula = Z ~ v1 + v2 + v3 + mar_status.factor + highest_ed.factor +
v4 + v5, data = sample)
Residuals:
3 4 5 7 8 10 13 14 15 16 17 18
2.682e-01 1.596e-16 9.714e-17 -3.469e-17 6.939e-18 -1.040e-01 1.162e-01 -6.675e-01 1.162e-01 1.175e-01 -1.162e-01 2.082e-17
21
2.696e-01
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.986e+00 2.662e+00 1.122 0.379
v1 -4.539e-02 3.908e-02 -1.161 0.365
v2Yes -3.502e-02 1.207e+00 -0.029 0.979
v3Yes -7.087e-03 8.727e-01 -0.008 0.994
mar_status.factorNever Married -1.184e+00 9.049e-01 -1.308 0.321
mar_status.factorSeparated -1.249e+00 1.656e+00 -0.754 0.530
highest_ed.factorBachelors Degree -6.862e-01 9.950e-01 -0.690 0.562
highest_ed.factorDid not complete high school 4.343e-02 8.932e-01 0.049 0.966
highest_ed.factorGED or equivalent 6.811e-01 1.085e+00 0.628 0.594
highest_ed.factorRegular high school diploma NA NA NA NA
highest_ed.factorSome college credit, but less than 1 year NA NA NA NA
v4 -2.079e-01 4.975e-01 -0.418 0.717
v5 3.320e-05 2.812e-05 1.181 0.359
Residual standard error: 0.5724 on 2 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.787, Adjusted R-squared: -0.2779
F-statistic: 0.739 on 10 and 2 DF, p-value: 0.6981
然而,当我添加一个名为"binary"这里——输出改变了,告诉我有一个"基本完美拟合"。现在,突然之间,有几个变量非常重要!
Call:
lm(formula = Z ~ v1 + v2 + v3 + mar_status.factor + highest_ed.factor +
v4 + v5 + binary, data = sample)
Residuals:
3 4 5 7 8 10 13 14 15 16 17 18
-1.414e-16 -8.628e-32 -4.314e-32 3.081e-32 -1.233e-32 8.539e-17 -2.853e-17 -5.686e-17 -2.853e-17 1.271e-16 2.853e-17 -6.163e-33
21
1.427e-17
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.000e-01 1.407e-15 -3.553e+14 1.79e-15 ***
v1 -3.208e-17 1.962e-17 -1.635e+00 0.349
v2Yes -1.000e+00 5.380e-16 -1.859e+15 3.42e-16 ***
v3Yes 1.000e+00 4.369e-16 2.289e+15 2.78e-16 ***
mar_status.factorNever Married -1.000e+00 3.546e-16 -2.820e+15 2.26e-16 ***
mar_status.factorSeparated -1.279e-15 7.279e-16 -1.758e+00 0.329
highest_ed.factorBachelors Degree -4.737e-16 4.294e-16 -1.103e+00 0.469
highest_ed.factorDid not complete high school 1.000e+00 4.346e-16 2.301e+15 2.77e-16 ***
highest_ed.factorGED or equivalent 3.286e-16 4.605e-16 7.140e-01 0.605
highest_ed.factorRegular high school diploma NA NA NA NA
highest_ed.factorSome college credit, but less than 1 year NA NA NA NA
v4 -3.218e-16 2.012e-16 -1.599e+00 0.356
v5 2.182e-20 1.421e-20 1.535e+00 0.368
binary 1.000e+00 2.743e-16 3.646e+15 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.22e-16 on 1 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 5.673e+30 on 11 and 1 DF, p-value: 3.275e-16
Warning message:
In summary.lm(lm(Z ~ v1 + v2 + v3 + mar_status.factor + highest_ed.factor + :
essentially perfect fit: summary may be unreliable
为什么添加这个变量会如此剧烈地改变输出?
m1 <- lm(formula = Z ~ v1 + v2 + v3 + mar_status.factor + highest_ed.factor +
v4 + v5, data = sample)
m2 <- update(m1, . ~ . + binary)
您有21个数据点(nrow(sample)
),但是只有13个观测值在响应或任何预测变量中缺失值被丢弃(R确实完成案例分析)(nobs(m1)
)。在第一个模型中,你有11个独立参数(length(na.omit(coef(m1)))
),在第二个模型中,你有12个。这就为模型1留下了两个剩余自由度(df.residual(m1)
),模型2只留下了一个剩余自由度,所以你从一个接近完美的模型变成了一个完美的模型。
你不会总是这样得到一个完美的模型(你需要0)残差df,不是1),但由于您的响应变量中只有两个不同的值(-0.5和0.5),因此您能够用13个观测值的12个系数完美地拟合数据也就不足为奇了……
注意R给了你一个警告消息
基本完美拟合:摘要可能不可靠
告诉你确切地R使用的计算在这种情况下失效了…