在r的线性模型中正确使用月份作为因子

  • 本文关键字:线性 模型 r regression lm
  • 更新时间 :
  • 英文 :


我正在尝试使用最高温度、降水和月份来正确地模拟最低温度。我知道有很多关于如何在线性模型中使用因子的问题,但老实说,似乎没有一个能回答我的问题。R处理和使用虚拟变量的方式让我很困惑。以下是我的数据的一个小样本,代码如下:

data <- structure(list(month = c(5, 6, 9, 8, 9, 9, 10, 10, 1, 3, 6, 4, 
11, 1, 3, 12, 8, 5, 12, 3, 10, 12, 9, 1, 1, 10, 12, 4, 7, 7, 
11, 8, 10, 3, 7, 1, 3, 9, 10, 11, 5, 1, 7, 10, 9, 11, 7, 4, 6, 
12, 10, 11, 11, 7, 5, 7, 5, 1, 6, 6, 5, 1, 1, 5, 5, 11, 12, 6, 
10, 6, 2, 6, 4, 11, 9, 6, 11, 3, 8, 12, 6, 2, 6, 3, 10, 9, 4, 
4, 5, 11, 11, 11, 1, 8, 4, 4, 10, 12, 9, 8), tmax = c(54, 84, 
74, 82, 63, 87, 68, 59, -4, 17, 69, 42, 46, 29, 38, 42, 95, 67, 
22, 48, 50, 34, 74, 40, 1, 71, 49, 32, 89, 74, 56, 92, 69, 23, 
86, 49, 47, 84, 48, 73, 62, 8, 83, 60, 69, 17, 90, 69, 77, 37, 
55, 43, 38, 93, 52, 84, 73, 35, 75, 83, 53, 33, 33, 81, 68, 55, 
31, 98, 72, 80, 13, 85, 71, 48, 68, 85, 53, 48, 92, 4, 61, 34, 
89, 62, 50, 62, 73, 63, 63, 33, 31, 57, 7, 72, 45, 64, 63, 31, 
65, 85), tmin = c(0.04, 0.21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0.01, 0, 0, 0, 0.14, 0.18, NA, 0.13, 0, 0.15, NA, 0.02, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0.5, 0, 0, 0, 0.38, 0, 0, 0, 0.01, 0, 0.42, 
NA, 0, NA, 0, NA, 0, 0, 0, 0, 0, 0, 0.25, 0, 0, 0.84, 0.03, 0, 
0, 0, 0, 0, 0, 0, 0.01, 0, NA, 0.26, 0, 0, 0, 0.32, 0, 0, 0, 
0, 0.2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0, NA, 
0.02, 0), precip = c(0.04, 0.21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0.01, 0, 0, 0, 0.14, 0.18, NA, 0.13, 0, 0.15, NA, 0.02, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0.5, 0, 0, 0, 0.38, 0, 0, 0, 0.01, 0, 
0.42, NA, 0, NA, 0, NA, 0, 0, 0, 0, 0, 0, 0.25, 0, 0, 0.84, 0.03, 
0, 0, 0, 0, 0, 0, 0, 0.01, 0, NA, 0.26, 0, 0, 0, 0.32, 0, 0, 
0, 0, 0.2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0, 
NA, 0.02, 0)), row.names = c(11604L, 32822L, 32919L, 35089L, 
40958L, 3690L, 34052L, 19787L, 26818L, 14839L, 21143L, 32761L, 
14364L, 14043L, 30552L, 30077L, 5846L, 2486L, 25352L, 13369L, 
21268L, 6355L, 16844L, 26847L, 35593L, 20523L, 10359L, 9379L, 
6200L, 26647L, 23129L, 19388L, 38057L, 12637L, 42724L, 15875L, 
1314L, 7352L, 34397L, 12146L, 27310L, 20622L, 8026L, 12121L, 
26709L, 7409L, 1091L, 11587L, 23699L, 31917L, 14328L, 19458L, 
10322L, 351L, 43747L, 23350L, 31329L, 8939L, 42693L, 34279L, 
18541L, 25011L, 37791L, 17834L, 2845L, 12519L, 19848L, 3978L, 
5907L, 28075L, 15177L, 3616L, 32037L, 9955L, 1498L, 17858L, 10700L, 
27624L, 4768L, 24624L, 20036L, 5683L, 43408L, 37485L, 21255L, 
15747L, 15234L, 7933L, 27690L, 24227L, 17286L, 30781L, 2358L, 
9885L, 28380L, 35327L, 8851L, 14743L, 37314L, 8057L), class = "data.frame")

如果使用以下代码,则输出中缺少1月份(下面的输出使用了包含42000行的整个数据集)。这是否意味着截距代表一月份?

tmin_model <- lm(data$tmin ~ data$tmax + data$precip + as.factor(data$month))
Call:
lm(formula = data$tmin ~ data$tmax + data$precip + as.factor(data$month))
Residuals:
Min      1Q  Median      3Q     Max 
-41.663  -4.827   0.182   5.110  22.489 
Coefficients:
Estimate Std. Error t value Pr(>|t|)    
(Intercept)             -13.524700   0.148019 -91.371  < 2e-16 ***
data$tmax                 0.674834   0.003098 217.837  < 2e-16 ***
data$precip               6.671204   0.164683  40.509  < 2e-16 ***
as.factor(data$month)2    1.090986   0.187072   5.832 5.52e-09 ***
as.factor(data$month)3    5.868886   0.189904  30.904  < 2e-16 ***
as.factor(data$month)4    7.325417   0.209629  34.945  < 2e-16 ***
as.factor(data$month)5   10.453276   0.230197  45.410  < 2e-16 ***
as.factor(data$month)6   14.364899   0.250073  57.443  < 2e-16 ***
as.factor(data$month)7   15.382325   0.260707  59.002  < 2e-16 ***
as.factor(data$month)8   14.269489   0.256420  55.649  < 2e-16 ***
as.factor(data$month)9   10.729316   0.238739  44.942  < 2e-16 ***
as.factor(data$month)10   7.209093   0.214178  33.659  < 2e-16 ***
as.factor(data$month)11   5.950449   0.192669  30.884  < 2e-16 ***
as.factor(data$month)12   2.752499   0.183948  14.963  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.286 on 39784 degrees of freedom
(4411 observations deleted due to missingness)
Multiple R-squared:  0.8929,    Adjusted R-squared:  0.8929 
F-statistic: 2.553e+04 on 13 and 39784 DF,  p-value: < 2.2e-16

我需要创建"dummy"变量的每个月做这个正确吗?同样,我如何用几个数据点来做一个"predict"?当我想要的只是使用模型返回的几个数据点时,我总是得到完整的42000行。例如,对于一月份的一个点,为什么下面的代码返回42000行?

predict.lm(tmin_model, newdata = data.frame(tmax = rnorm(1, 20, 13), month = 1, precip = 0, tmin = NA))

谢谢。

构建模型

data$month <- factor(data$month)
tmin_model <- lm(tmin ~tmax + precip + month, data = data)

只返回一行

predict.lm(tmin_model, newdata =
data.frame(tmax = rnorm(1, 20, 13), month = factor(1), precip = 0, tmin = NA))
1 
-7.905385e-18 

相关内容

  • 没有找到相关文章

最新更新