在使用procomp()执行PCA之前,在R中添加有意义的0值常数并输入缺失数据



目前正在对包含有意义的0值和具有真正缺失值的列(即无意义的na)的数据进行统计分析,我想寻求一些帮助。

  1. 我想对包含多个0值的数据集执行主成分分析,这些0值不缺少数据(即它们指的是温度)。目的是根据不同地点和季节的温度变化对数据进行聚类。因此,由于prcomp()函数认为0值是R中的缺失值,我想知道什么可以阻止我向整个数据集添加常数(例如1)。这样,0值将被转换为1,并且这个常数也将被添加到数据集中的每个数值变量中。通过这样做,我假设我可以保留数据的原始变体,而不会在技术上阻碍R执行我希望它执行的PCA。但是由于我对这种方法不是很有信心,所以我想问你是否有什么可以阻止我这样做。
# Create a reproducible dataset
my_df <- data.frame(
Location = rep(LETTERS[1:6], 1000/2), 
Zone = sample(c("Europe", " America", "Africa", "Antartic"), replace = TRUE),
Temperatures = round(rnorm(1000), digits = 2)*10,
RISK_MM = round(rnorm(1000), digits = 2)*100,
Pressure = round(rnorm(1000), digits = 2)*1000,
Sunshine = round(rnorm(1000), digits = 2))
# Add "0" values and NAs to my dataset in regards of specific categorical variables
my_df <- my_df %>% 
mutate(
Temperatures = case_when(
str_detect(Zone,"Antartic") & str_detect(Location,"A") ~ 0,
str_detect(Zone,"Antartic") & str_detect(Location,"B") ~ 0,
str_detect(Zone,"Europe") & str_detect(Location,"A") ~ 0,
str_detect(Zone,"Europe") & str_detect(Location,"B") ~ 0,
str_detect(Zone,"America") & str_detect(Location,"C") ~ 0,
str_detect(Zone,"Antartic") & str_detect(Location,"C") ~ NA_real_,
str_detect(Zone,"Africa") & str_detect(Location,"D") ~ NA_real_,
str_detect(Zone,"Africa") & str_detect(Location,"F") ~ NA_real_,
TRUE ~ as.numeric(as.character(Temperatures))))
# Convert characters into factors
my_df <- mutate_if(my_df, is.character, as.factor)
# Print the results
print(my_df)
  1. 一些列也缺少我认为在使用mice()函数执行PCA之前对我感兴趣的分类独立变量进行估算的值,例如:
# Run the multiple (m = 5) imputation
imp <- my_df %>%
group_by(Location, Zone) %>% 
mice(m = 5, maxit = 50, method = "cart", seed = 123)
# Create a dataset after imputation
completeImputedData <- complete(imp, 1)
# Convert the initial dataset to a numerical dataset
completeImputedData_num <- completeImputedData %>%
ungroup() %>%
dplyr::select(is.numeric) 
# Add a constant 
completeImputedData_num_cs <- completeImputedData_num + 1
# Dimension reduction using PCA and scale the data.
my_pca <- prcomp(completeImputedData_num_cs,  scale = TRUE, center = TRUE)
# Keep going...

但是,你认为这些方法适合我的需要吗?还是应该建议我研究一种不同的聚类方法或另一种估算数据的方法?

感谢您的关注,祝您度过愉快的一天。

最诚挚的问候,Philippe

我不认为你的零值会影响你的分析。让我解释一下。当您使用PCA时,重要的是缩放数据,您已经这样做了。你注意到这对你的数据有什么影响了吗?

从你的代码开始-我添加了库和set.seed()

library(tidyverse)
library(mice)
# Create a reproducible dataset
set.seed(22123)                    # I added this to make this reproducible
my_df <- data.frame(
Location = rep(LETTERS[1:6], 1000/2), 
Zone = sample(c("Europe", " America", "Africa", "Antartic"), replace = TRUE),
Temperatures = round(rnorm(1000), digits = 2)*10,
RISK_MM = round(rnorm(1000), digits = 2)*100,
Pressure = round(rnorm(1000), digits = 2)*1000,
Sunshine = round(rnorm(1000), digits = 2))
# Add "0" values and NAs to my dataset in regards of specific categorical variables
my_df <- my_df %>% 
mutate(
Temperatures = case_when(
str_detect(Zone,"Antartic") & str_detect(Location,"A") ~ 0,
str_detect(Zone,"Antartic") & str_detect(Location,"B") ~ 0,
str_detect(Zone,"Europe") & str_detect(Location,"A") ~ 0,
str_detect(Zone,"Europe") & str_detect(Location,"B") ~ 0,
str_detect(Zone,"America") & str_detect(Location,"C") ~ 0,
str_detect(Zone,"Antartic") & str_detect(Location,"C") ~ NA_real_,
str_detect(Zone,"Africa") & str_detect(Location,"D") ~ NA_real_,
str_detect(Zone,"Africa") & str_detect(Location,"F") ~ NA_real_,
TRUE ~ as.numeric(as.character(Temperatures))))
# Convert characters into factors
my_df <- mutate_if(my_df, is.character, as.factor)
summary(my_df)
funModeling::df_status(my_df)
head(my_df)
# Run the multiple (m = 5) imputation
imp <- my_df %>%
group_by(Location, Zone) %>% 
mice(m = 5, maxit = 50, method = "cart", seed = 123,
printFlag = F)                                 # I added this
# Create a dataset after imputation
completeImputedData <- complete(imp, 1)
# Convert the initial dataset to a numerical dataset
completeImputedData_num <- completeImputedData %>%
ungroup() %>%
dplyr::select(where(is.numeric))                 # I added where()
# Add a constant 
completeImputedData_num_cs <- completeImputedData_num + 1
# Dimension reduction using PCA and scale the data.
(my_pca <- prcomp(completeImputedData_num_cs,  
scale = TRUE, 
center = TRUE))         # I added the encapsulating parentheses 
# (print and create object simultaneously)
# Standard deviations (1, .., p=4):
# [1] 1.0413130 1.0067763 0.9933300 0.9567467
# 
# Rotation (n x k) = (4 x 4):
#                     PC1          PC2          PC3         PC4
# Temperatures  0.6329000 -0.330262830  0.295715996  0.63475670
# RISK_MM      -0.3136613 -0.629210192  0.638899673 -0.31227925
# Pressure      0.7077274  0.003289435  0.005391224 -0.70645740
# Sunshine     -0.0132701 -0.703569596 -0.710162089 -0.02198944

但是如果您在PCA调用之外缩放并居中呢?

df <- scale(completeImputedData_num_cs)
(my_pca <- prcomp(df,
scale = F,   # added for clarity only
center = F))
# Standard deviations (1, .., p=4):
# [1] 1.0413130 1.0067763 0.9933300 0.9567467
# 
# Rotation (n x k) = (4 x 4):
#                     PC1          PC2          PC3         PC4
# Temperatures  0.6329000 -0.330262830  0.295715996  0.63475670
# RISK_MM      -0.3136613 -0.629210192  0.638899673 -0.31227925
# Pressure      0.7077274  0.003289435  0.005391224 -0.70645740
# Sunshine     -0.0132701 -0.703569596 -0.710162089 -0.02198944

PCA结果相同。没有零。当你缩放数据时,0就不再是0了。查看一下:

funModeling::df_status(completeImputedData_num_cs)
#       variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
# 1 Temperatures      12     0.4    0    0     0     0 numeric    383
# 2      RISK_MM      21     0.7    0    0     0     0 numeric    378
# 3     Pressure       0     0.0    0    0     0     0 numeric    376
# 4     Sunshine       9     0.3    0    0     0     0 numeric    364 
funModeling::df_status(df)
#           variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
# 1 var.Temperatures       0       0    0    0     0     0 numeric    383
# 2      var.RISK_MM       0       0    0    0     0     0 numeric    378
# 3     var.Pressure       0       0    0    0     0     0 numeric    376
# 4     var.Sunshine       0       0    0    0     0     0 numeric    364 

最新更新