我是'r'程序的新手,目前想处理丢失的值。基本上,我有一个带有几列的数据集,并且"购买"列中有缺少值。
我想基于缺失值的" master_category"列的购买值的平均值。
(Python代码)
# generate missing Purchase values
miss_Purch_rows = dataset['Purchase'].isnull()
# Check Purchase values from the grouping by the newly created Master_Product_Category column
categ_mean = dataset.groupby(['Master_Product_Category'])['Purchase'].mean()
# Impute mean Purchase value based on Master_Product_Category column
dataset.loc[miss_Purch_rows,'Purchase'] = dataset.loc[miss_Purch_rows,'Master_Product_Category'].apply(lambda x: categ_mean.loc[x])
我正在寻找" r-program"中的类似代码,以均值并与另一列有关。
数据集的示例数据如下;
User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
1 1000001 P00000142 F 0-17 10 0 345 13650
2 1000001 P00004842 F 0-17 10 0 3412 13645
3 1000001 P00025442 F 0-17 10 0 129 15416
4 1000001 P00051442 F 0-17 10 0 8170 9938
5 1000001 P00051842 F 0-17 10 0 480 2849
6 1000001 P00057542 F 0-17 10 0 345 NA
7 1000001 P00058142 F 0-17 10 0 3412 11051
8 1000001 P00058242 F 0-17 10 0 3412 NA
9 1000001 P00059442 F 0-17 10 0 6816 16622
10 1000001 P00064042 F 0-17 10 0 3412 8190
我尝试过;
with(dataset, sapply(X = Purchase, INDEX = Master_Category, FUN = mean, na.rm = TRUE))
但似乎不起作用。
这种类型的每组操作通常很容易通过 tidyverse 集合:
首先,我们在您的示例数据中阅读:
txt <- 'User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
1000001 P00000142 F 0-17 10 0 345 13650
1000001 P00004842 F 0-17 10 0 3412 13645
1000001 P00025442 F 0-17 10 0 129 15416
1000001 P00051442 F 0-17 10 0 8170 9938
1000001 P00051842 F 0-17 10 0 480 2849
1000001 P00057542 F 0-17 10 0 345 NA
1000001 P00058142 F 0-17 10 0 3412 11051
1000001 P00058242 F 0-17 10 0 3412 NA
1000001 P00059442 F 0-17 10 0 6816 16622
1000001 P00064042 F 0-17 10 0 3412 8190'
df <- read.table(text = txt, header = T)
然后,我们通过" master_category"进行分组,并使用ifelse
内部的CC_2均值填充任何NA
值:
library(tidyverse)
df.new <- df %>%
group_by(Master_Category) %>%
mutate(Purchase = ifelse(is.na(Purchase), mean(Purchase, na.rm = T), Purchase))
User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
<int> <fct> <lgl> <fct> <int> <int> <int> <dbl>
1 1000001 P00000142 FALSE 0-17 10 0 345 13650
2 1000001 P00004842 FALSE 0-17 10 0 3412 13645
3 1000001 P00025442 FALSE 0-17 10 0 129 15416
4 1000001 P00051442 FALSE 0-17 10 0 8170 9938
5 1000001 P00051842 FALSE 0-17 10 0 480 2849
6 1000001 P00057542 FALSE 0-17 10 0 345 13650
7 1000001 P00058142 FALSE 0-17 10 0 3412 11051
8 1000001 P00058242 FALSE 0-17 10 0 3412 10962
9 1000001 P00059442 FALSE 0-17 10 0 6816 16622
10 1000001 P00064042 FALSE 0-17 10 0 3412 8190