我正在处理这个数据集的一个问题。我正试图建立一个模型,从其他所有预测因素中预测日本销售额(除了排名、名称和全球销售额,这与结果变量过于相关(。所以,我做到了:
vgames <- read_csv('data/vgsales.csv', show_col_types = FALSE, col_types = list(
Year = col_date("%Y")
)) %>%
mutate(
Platform = factor(Platform),
Genre = factor(Genre),
Publisher = factor(Publisher)
)
vgames_model <- vgames %>%
select(-c(Rank, Name, Global_Sales))
# Train test split
vgames_split <- vgames_model %>% initial_split()
vgames_training <- vgames_split %>% training()
vgames_testing <- vgames_split %>% testing()
# Folds for CV
vgames_folds <- vgames_training %>% vfold_cv(v = 10)
# Recipe
vgames_recipe <- vgames_training %>%
recipe(formula = JP_Sales ~ .) %>%
step_normalize(all_numeric_predictors()) %>%
step_date(Year, features = c("year"), keep_original_cols = FALSE) %>%
step_dummy(all_nominal()) %>%
step_zv(all_numeric_predictors())
这个配方的输出是这样的:
# A tibble: 12,448 × 570
NA_Sales EU_Sales Other_…¹ JP_Sa…² Year_…³ Platf…⁴ Platf…⁵ Platf…⁶ Platf…⁷ Platf…⁸ Platf…⁹ Platf…˟ Platf…˟ Platf…˟
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.272 -0.279 -0.240 0 2006 0 0 0 0 0 0 0 0 0
2 0.145 0.258 0.0629 0 2012 0 0 0 0 0 0 0 0 0
3 -0.198 -0.241 -0.189 0.07 2008 0 0 0 1 0 0 0 0 0
4 -0.149 -0.260 -0.189 0 2010 0 0 0 1 0 0 0 0 0
5 -0.149 -0.0679 -0.0380 0 2006 0 0 0 0 0 0 0 0 0
6 -0.296 -0.183 -0.189 0 2015 0 1 0 0 0 0 0 0 0
7 3.32 1.05 0.315 1.81 1988 0 0 0 0 0 0 0 0 0
8 -0.308 -0.260 -0.240 0 2016 0 0 0 0 0 0 0 0 0
9 -0.321 -0.202 -0.240 0 2015 0 0 0 0 0 0 0 0 0
10 -0.112 -0.145 -0.139 0 2010 0 0 0 0 0 0 0 0 0
# … with 12,438 more rows, 556 more variables: Platform_N64 <dbl>, Platform_NES <dbl>, Platform_NG <dbl>,
# Platform_PC <dbl>, Platform_PCFX <dbl>, Platform_PS <dbl>, Platform_PS2 <dbl>, Platform_PS3 <dbl>,
# Platform_PS4 <dbl>, Platform_PSP <dbl>, Platform_PSV <dbl>, Platform_SAT <dbl>, Platform_SCD <dbl>,
# Platform_SNES <dbl>, Platform_TG16 <dbl>, Platform_Wii <dbl>, Platform_WiiU <dbl>, Platform_WS <dbl>,
# Platform_X360 <dbl>, Platform_XB <dbl>, Platform_XOne <dbl>, Genre_Adventure <dbl>, Genre_Fighting <dbl>,
# Genre_Misc <dbl>, Genre_Platform <dbl>, Genre_Puzzle <dbl>, Genre_Racing <dbl>, Genre_Role.Playing <dbl>,
# Genre_Shooter <dbl>, Genre_Simulation <dbl>, Genre_Sports <dbl>, Genre_Strategy <dbl>, …
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
现在,问题来了:当我定义和拟合mlp时,历元将所有nan都作为损失函数和其他度量,即:
nn <- mlp(epochs = 20) %>%
set_engine('keras', verbose = 1, metrics = c("mae"), optimizer = 'adam', loss = 'mean_absolute_error') %>%
set_mode('regression')
nnwf <- workflow() %>%
add_model(nn) %>%
add_recipe(vgames_recipe)
nnwf %>% fit(vgames_training)
产生
...
Epoch 16/20
389/389 [==============================] - 1s 1ms/step - loss: nan - mae: nan
Epoch 17/20
389/389 [==============================] - 1s 1ms/step - loss: nan - mae: nan
Epoch 18/20
389/389 [==============================] - 1s 2ms/step - loss: nan - mae: nan
Epoch 19/20
389/389 [==============================] - 1s 2ms/step - loss: nan - mae: nan
Epoch 20/20
389/389 [==============================] - 1s 1ms/step - loss: nan - mae: nan
我已经环顾四周,试图在其他方面进行规范化,降低学习率(在mlp((函数和set_engine规范中(,并完全删除日期列。这些都不起作用,我很难弄清楚是什么。以前有人遇到过这个问题吗?
原始Year
列中缺少数据,缺少数据会生成缺少的统计信息。