目前正在对包含有意义的0值和具有真正缺失值的列(即无意义的na)的数据进行统计分析,我想寻求一些帮助。
- 我想对包含多个0值的数据集执行主成分分析,这些0值不缺少数据(即它们指的是温度)。目的是根据不同地点和季节的温度变化对数据进行聚类。因此,由于prcomp()函数认为0值是R中的缺失值,我想知道什么可以阻止我向整个数据集添加常数(例如1)。这样,0值将被转换为1,并且这个常数也将被添加到数据集中的每个数值变量中。通过这样做,我假设我可以保留数据的原始变体,而不会在技术上阻碍R执行我希望它执行的PCA。但是由于我对这种方法不是很有信心,所以我想问你是否有什么可以阻止我这样做。
# Create a reproducible dataset
my_df <- data.frame(
Location = rep(LETTERS[1:6], 1000/2),
Zone = sample(c("Europe", " America", "Africa", "Antartic"), replace = TRUE),
Temperatures = round(rnorm(1000), digits = 2)*10,
RISK_MM = round(rnorm(1000), digits = 2)*100,
Pressure = round(rnorm(1000), digits = 2)*1000,
Sunshine = round(rnorm(1000), digits = 2))
# Add "0" values and NAs to my dataset in regards of specific categorical variables
my_df <- my_df %>%
mutate(
Temperatures = case_when(
str_detect(Zone,"Antartic") & str_detect(Location,"A") ~ 0,
str_detect(Zone,"Antartic") & str_detect(Location,"B") ~ 0,
str_detect(Zone,"Europe") & str_detect(Location,"A") ~ 0,
str_detect(Zone,"Europe") & str_detect(Location,"B") ~ 0,
str_detect(Zone,"America") & str_detect(Location,"C") ~ 0,
str_detect(Zone,"Antartic") & str_detect(Location,"C") ~ NA_real_,
str_detect(Zone,"Africa") & str_detect(Location,"D") ~ NA_real_,
str_detect(Zone,"Africa") & str_detect(Location,"F") ~ NA_real_,
TRUE ~ as.numeric(as.character(Temperatures))))
# Convert characters into factors
my_df <- mutate_if(my_df, is.character, as.factor)
# Print the results
print(my_df)
- 一些列也缺少我认为在使用mice()函数执行PCA之前对我感兴趣的分类独立变量进行估算的值,例如:
# Run the multiple (m = 5) imputation
imp <- my_df %>%
group_by(Location, Zone) %>%
mice(m = 5, maxit = 50, method = "cart", seed = 123)
# Create a dataset after imputation
completeImputedData <- complete(imp, 1)
# Convert the initial dataset to a numerical dataset
completeImputedData_num <- completeImputedData %>%
ungroup() %>%
dplyr::select(is.numeric)
# Add a constant
completeImputedData_num_cs <- completeImputedData_num + 1
# Dimension reduction using PCA and scale the data.
my_pca <- prcomp(completeImputedData_num_cs, scale = TRUE, center = TRUE)
# Keep going...
但是,你认为这些方法适合我的需要吗?还是应该建议我研究一种不同的聚类方法或另一种估算数据的方法?
感谢您的关注,祝您度过愉快的一天。
最诚挚的问候,Philippe
我不认为你的零值会影响你的分析。让我解释一下。当您使用PCA时,重要的是缩放数据,您已经这样做了。你注意到这对你的数据有什么影响了吗?
从你的代码开始-我添加了库和set.seed()
。
library(tidyverse)
library(mice)
# Create a reproducible dataset
set.seed(22123) # I added this to make this reproducible
my_df <- data.frame(
Location = rep(LETTERS[1:6], 1000/2),
Zone = sample(c("Europe", " America", "Africa", "Antartic"), replace = TRUE),
Temperatures = round(rnorm(1000), digits = 2)*10,
RISK_MM = round(rnorm(1000), digits = 2)*100,
Pressure = round(rnorm(1000), digits = 2)*1000,
Sunshine = round(rnorm(1000), digits = 2))
# Add "0" values and NAs to my dataset in regards of specific categorical variables
my_df <- my_df %>%
mutate(
Temperatures = case_when(
str_detect(Zone,"Antartic") & str_detect(Location,"A") ~ 0,
str_detect(Zone,"Antartic") & str_detect(Location,"B") ~ 0,
str_detect(Zone,"Europe") & str_detect(Location,"A") ~ 0,
str_detect(Zone,"Europe") & str_detect(Location,"B") ~ 0,
str_detect(Zone,"America") & str_detect(Location,"C") ~ 0,
str_detect(Zone,"Antartic") & str_detect(Location,"C") ~ NA_real_,
str_detect(Zone,"Africa") & str_detect(Location,"D") ~ NA_real_,
str_detect(Zone,"Africa") & str_detect(Location,"F") ~ NA_real_,
TRUE ~ as.numeric(as.character(Temperatures))))
# Convert characters into factors
my_df <- mutate_if(my_df, is.character, as.factor)
summary(my_df)
funModeling::df_status(my_df)
head(my_df)
# Run the multiple (m = 5) imputation
imp <- my_df %>%
group_by(Location, Zone) %>%
mice(m = 5, maxit = 50, method = "cart", seed = 123,
printFlag = F) # I added this
# Create a dataset after imputation
completeImputedData <- complete(imp, 1)
# Convert the initial dataset to a numerical dataset
completeImputedData_num <- completeImputedData %>%
ungroup() %>%
dplyr::select(where(is.numeric)) # I added where()
# Add a constant
completeImputedData_num_cs <- completeImputedData_num + 1
# Dimension reduction using PCA and scale the data.
(my_pca <- prcomp(completeImputedData_num_cs,
scale = TRUE,
center = TRUE)) # I added the encapsulating parentheses
# (print and create object simultaneously)
# Standard deviations (1, .., p=4):
# [1] 1.0413130 1.0067763 0.9933300 0.9567467
#
# Rotation (n x k) = (4 x 4):
# PC1 PC2 PC3 PC4
# Temperatures 0.6329000 -0.330262830 0.295715996 0.63475670
# RISK_MM -0.3136613 -0.629210192 0.638899673 -0.31227925
# Pressure 0.7077274 0.003289435 0.005391224 -0.70645740
# Sunshine -0.0132701 -0.703569596 -0.710162089 -0.02198944
但是如果您在PCA调用之外缩放并居中呢?
df <- scale(completeImputedData_num_cs)
(my_pca <- prcomp(df,
scale = F, # added for clarity only
center = F))
# Standard deviations (1, .., p=4):
# [1] 1.0413130 1.0067763 0.9933300 0.9567467
#
# Rotation (n x k) = (4 x 4):
# PC1 PC2 PC3 PC4
# Temperatures 0.6329000 -0.330262830 0.295715996 0.63475670
# RISK_MM -0.3136613 -0.629210192 0.638899673 -0.31227925
# Pressure 0.7077274 0.003289435 0.005391224 -0.70645740
# Sunshine -0.0132701 -0.703569596 -0.710162089 -0.02198944
PCA结果相同。没有零。当你缩放数据时,0就不再是0了。查看一下:
funModeling::df_status(completeImputedData_num_cs)
# variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
# 1 Temperatures 12 0.4 0 0 0 0 numeric 383
# 2 RISK_MM 21 0.7 0 0 0 0 numeric 378
# 3 Pressure 0 0.0 0 0 0 0 numeric 376
# 4 Sunshine 9 0.3 0 0 0 0 numeric 364
funModeling::df_status(df)
# variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
# 1 var.Temperatures 0 0 0 0 0 0 numeric 383
# 2 var.RISK_MM 0 0 0 0 0 0 numeric 378
# 3 var.Pressure 0 0 0 0 0 0 numeric 376
# 4 var.Sunshine 0 0 0 0 0 0 numeric 364